Вы находитесь на странице: 1из 18

Rohan Kapur Follow

rohankapur.com
Feb 16, 2016 · 11 min read

https://qph.ec.quoracdn.net/main-qimg-99f7ea7d0929ed41dbecf67ec51b80b3?convert_to_webp=true

Rohan #3: Deriving the Normal Equation


using matrix calculus
Understanding the analytical solution to linear regression.

. . .

This is the third entry in my journey to extend my


knowledge of Artiâcial Intelligence in the year of 2016.
Learn more about my motives in this introduction
post.

. . .
Before I start exploring new algorithms/practices in Machine Learning
& Co., I want to first refine my current knowledge. Part of this
involves plugging gaps in content I either failed to understand or
research during my prior endeavors in Artificial Intelligence thus far.
Today, I will be addressing the Normal Equation in a regression
context. I am not going to explore Machine Learning in depth (but
you can learn more about it in my previous two posts) so this article
assumes basic comprehension of supervised Machine Learning
regression algorithms. It also assumes some background to matrix
calculus, but an intuition of both calculus and Linear Algebra
separately will suffice.

Today, we try to derive and understand this identity/equation:

Look’s daunting? Keep reading!

Recall that a supervised Machine Learning regression algorithm


examines a set of data points in a Cartesian instance space and tries to
define a model that “fits” said data set with high accuracy. In this
context, a model is any linear or non-linear function.

Data points in red, residuals in gray, model in blue — http://www.pieceofshijiabian.com/wp-


content/uploads/2014/01/Screen-Shot-2014-01-18-at-5.50.07-PM.png

As you can see, the blue line captures the trend (that is, how the data
points move across across the instance space) in this two dimensional,
noisy example. The term “residuals”, primarily used in Statistics and
rarely in Machine Learning, may be new for many of you. Residuals,
the vertical lines in gray, are the distances between each of the y-
coordinates of the data points and their respective predictions on the
model. At TV = 0, our model predicted a value of Sales = 6.5 even
though the actual value was Sales = 1 (I’m eye-balling this!). Hence,
the residual for this instance is 5.5.

We would intuitively suggest that a smaller residual is preferred over


a greater residual, however a model may not (and again, we may not
want it to — overfitting) be able to accommodate a change in its
output for one data point and keep its output constant for all other
data points. So, instead, we say that we want to calculate the
minimum average residual. That is, we want to minimize the value of
the summation of each data point’s residual divided by the number of
data points. We can denote this as the “cost function” for our model,
and the output of said function as the “cost”.

The final step is to evaluate how we use a residual in the cost function
(the principle unit of the cost function). Simply summing over the
residuals (which are mere differences/subtractions) is actually not
suitable because it may lead to negative values and fails to capture the
idea of Cartesian “distance” (which would cause our final cost to be
inaccurate). A valid solution would be wrapping an absolute value
over our subtraction expression. Another approach is to square the
expression. The latter is generally preferred because of its
Mathematical convenience when performing differential Calculus.

OK. Let’s approach this Mathematically. First, we will define our


model as:

This is pretty much y = mx + c but in a high-dimensional space such


that x is vector of values. The TV sales scatter plot diagram above is
two-dimensional and hence has a single x: TV, however this need not
always be the case.

Notice that this is the “dot-product” between the vectors Θ and x. We


can rewrite this using the conveniences of Linear Algebra notation:
The “transpose” operation (which looks like a value raised to the power of “T”) switches the rows and columns
of any matrix. Now, we can get a legal multiplication between vectors Θ and x.

Θ stores the real coefficients (or “weights”) of the input features x


and is hence of the exact same dimensionality as x. It is important to
note that we prefix the vector x with 1 so that we can add support for
a constant term in our model.

Now, we can expect to plug in any value x for TV and expect to get an
accurate projection for the number of Sales we plan to make. The
only issue is, well, we have to figure out the value(s) of Θ that form
an accurate model. As humans, we may be able to guess a few values
but a) this will not apply in higher-dimensional applications b) a
computer cannot do “guess-work” c) we want to find the most
accurate Θ.

x represents any single data point we enter to receive a projection. We


can now define X as a vector of all the inputs for the the A-Priori data
points that we use to shape our model. Note that each data point is a
vector in itself, so X is a large matrix (denoted the “design matrix”)
whereby each row (we use the notation X superscript arbitrary index
i) is one vector x. y is, conversely, the outputs for these A-Priori data
points. We define m as the number of rows in X, or otherwise the
number of A-Priori data points. In Machine Learning this is called a
“training set”.

I hope it is clear that, here, Θ essentially defines the model.


Everything else stays constant. So, to find the most accurate model,
we need to find the Θ with the lowest cost (lowest average residual —
on top of a function — as stated earlier). Let’s put our cost function in
Mathematical notation:

The residual is simply the difference between the actual value and the
predicted value (the input values fed through the model):
Earlier we discussed different ways we could interact with the residual
to create a principle cost unit. Squared difference is the preferred one:

We square the residuals because of the Mathematical convenience when we use diÜerential Calculus

Inserting this expression in place with the cost function leaves us


with:

Notice how we’ve multiplied the overall expression by a half. We do


this, as per usual, for Mathematical convenience.

Unfortunately, before we can proceed, we will need to represent the


cost function in vectorized (using Linear Algebra) format. Instead of
explicitly summing over the square residuals, we can calculate the
square residuals with matrix operations.

If we were to collect all the residuals in one vector, it would look


somewhat like this:
We can further split this up:

Next:
Now let’s simplify the left vector:

And then:
And bam, we can clearly see that this simplifies to:

But we’re forgetting something important; each residual is squared.


This poses a problem because we cannot simply square the overall
expression XΘ − y. Why? Put simply, the square of a vector/matrix is
not equal to the square of each of its inner values. So — what do
we actually want to do? We want to take each value in the vector XΘ
− y and multiply it by itself. This is what we can formulate:

If we want to get each value in XΘ − y to multiply by itself, we can


use a second XΘ − y and have each value in the first vector multiply
by the value at the same index in the second vector. This is a dot-
product. Recall that when we formulated our hypothesis/model
function, we took the dot-product of two vectors as the multiplication
between one of the vectors transposed and the other vector. Hence, to
achieve square residuals, we end up with the following expression:
By adding the division term to the front of the expression, we should
have the final Linear Algebra/vectorized expression for our cost
function:

So, now that we’re able to “score” Θ, the end goal is simple:

In other words, find the Θ that minimizes on the output of the cost
function and hence represents the most accurate model with lowest
average residual.

So, how shall we approach this? I hope it will become clear through
this graph:

It’s convex!

That’s right — in Linear Regression one can prove that Cost(Θ) is a


convex function! So, at this point, it seems pretty simple: apply
differential Calculus to solve for the least cost. Let’s get started!

We are going to differentiate with respect to, as expected, Θ. At this


point, we can drop the division term because when we equate the
derivative to zero it will nulled out.
We are yet again going to be simplifying the cost function (RHS), this
time using matrix transpose distribution identity specifically (AB)’ ≡
B’A’:

EDIT: I make a mistake here and do (AB)’ = A’B’, so I end up


having to rearrange (which would otherwise be illegal due to
non-commutative nature of matrix multiplication). This does not
matter with respect to determining the final derivative.

At this point, we can expand the two brackets:

Since vectors y and XΘ are of the same dimensionality and are both
vectors, they (or rather the transpose of one with the other) satisfy
the commutative property (which states that a * b = b * a — if you
weren’t aware, this property is not the case for higher-dimensional
matrices). Hence, the two middle terms can be collected into one
term:

Let’s try to remove all the instances of transpose (again, using the
distribution identities):

We can remove the last term because, when deriving with respect to
Θ, it has no effect and the derivative is zero (since the term Θ is not
involved):
Let us evaluate the left term first:

Note that the X terms are simply constants/scalars in this partial


derivative, so we can “take it out” (using differentiation rules) like so:

Using the following matrix calculus scalar-by-vector identity:

EDIT: I want to add intuition for the rule here. Recall that a vector
multiplied by its transpose results in a vector where each value is
squared — but this is not the same as the square of the vector itself. If
we were to take the derivative of this vector with respect to the
original vector, we apply the basic principle which states that we
derive by each value in the vector separately:
That simplifies to:
So now, using this identity, we can solve the derivative of the left term
to be:

Let’s move on. The right term is much simpler to solve:


Taking the scalars out

Using the follow matrix calculus conjugate transpose rule:

EDIT: I hope the intuition behind this identity becomes clear using
the same principles in the previous explanation.

We can see that:

Now we can combine the two terms together to formulate the final
derivative for the cost function:

Again — this excludes the 1/2m division term

Recall that the only turning point — a point where the instantaneous
gradient is zero — for a convex function will be the global
minimum/maximum of the function. So, we set the derivative to
zero and hence find said minimum point:

By setting the derivative of the cost function to zero, we are änding the Θ that attains the lowest cost

Now, we sub in our computed derivative:


With some basic algebra, we’re on our way!:

Slow down, ! Since matrices are not commutative, we can’t isolate Θ


by simply “bringing down” the X terms. Instead, we multiply both
sides by the inverse of said terms (remember that division is the same
as multiplication of the inverse):

The inverse of any value multiplied by the value itself is just 1, so…:

We got it!

Here we have it — finally! This is our solution for Θ where the cost is
at its global minimum.

. . .

Elephant in the room — why isn’t this preferred over Gradient


Descent‽ What an elegant solution this is, you say. Well, here’s the
main reason why: computing the seemingly harmless inverse of that
(m * n) by (n * m) matrix is, with today’s most efficient Computer
Science algorithm, of cubic time complexity! This means that as the
dimensions of X increase, the amount of operations required to
compute the final result increases in a cubic trend. If X was rather
small and especially had a low value for n/wasn’t of high dimensions,
then using the Normal Equation would be feasible. But for any
industrial application with large datasets, the Normal Equation would
take extremely — sometimes nonsensically — long.

I hope that was rewarding for you. It definitely was for me. This was
my first step into matrix calculus, which proved to be a stimulating
challenge. Next article will be even more daring: a full Mathematical
writeup on backpropagation! See you then!

Вам также может понравиться