Академический Документы
Профессиональный Документы
Культура Документы
rohankapur.com
Feb 16, 2016 · 11 min read
https://qph.ec.quoracdn.net/main-qimg-99f7ea7d0929ed41dbecf67ec51b80b3?convert_to_webp=true
. . .
. . .
Before I start exploring new algorithms/practices in Machine Learning
& Co., I want to first refine my current knowledge. Part of this
involves plugging gaps in content I either failed to understand or
research during my prior endeavors in Artificial Intelligence thus far.
Today, I will be addressing the Normal Equation in a regression
context. I am not going to explore Machine Learning in depth (but
you can learn more about it in my previous two posts) so this article
assumes basic comprehension of supervised Machine Learning
regression algorithms. It also assumes some background to matrix
calculus, but an intuition of both calculus and Linear Algebra
separately will suffice.
As you can see, the blue line captures the trend (that is, how the data
points move across across the instance space) in this two dimensional,
noisy example. The term “residuals”, primarily used in Statistics and
rarely in Machine Learning, may be new for many of you. Residuals,
the vertical lines in gray, are the distances between each of the y-
coordinates of the data points and their respective predictions on the
model. At TV = 0, our model predicted a value of Sales = 6.5 even
though the actual value was Sales = 1 (I’m eye-balling this!). Hence,
the residual for this instance is 5.5.
The final step is to evaluate how we use a residual in the cost function
(the principle unit of the cost function). Simply summing over the
residuals (which are mere differences/subtractions) is actually not
suitable because it may lead to negative values and fails to capture the
idea of Cartesian “distance” (which would cause our final cost to be
inaccurate). A valid solution would be wrapping an absolute value
over our subtraction expression. Another approach is to square the
expression. The latter is generally preferred because of its
Mathematical convenience when performing differential Calculus.
Now, we can expect to plug in any value x for TV and expect to get an
accurate projection for the number of Sales we plan to make. The
only issue is, well, we have to figure out the value(s) of Θ that form
an accurate model. As humans, we may be able to guess a few values
but a) this will not apply in higher-dimensional applications b) a
computer cannot do “guess-work” c) we want to find the most
accurate Θ.
The residual is simply the difference between the actual value and the
predicted value (the input values fed through the model):
Earlier we discussed different ways we could interact with the residual
to create a principle cost unit. Squared difference is the preferred one:
We square the residuals because of the Mathematical convenience when we use diÜerential Calculus
Next:
Now let’s simplify the left vector:
And then:
And bam, we can clearly see that this simplifies to:
So, now that we’re able to “score” Θ, the end goal is simple:
In other words, find the Θ that minimizes on the output of the cost
function and hence represents the most accurate model with lowest
average residual.
So, how shall we approach this? I hope it will become clear through
this graph:
It’s convex!
Since vectors y and XΘ are of the same dimensionality and are both
vectors, they (or rather the transpose of one with the other) satisfy
the commutative property (which states that a * b = b * a — if you
weren’t aware, this property is not the case for higher-dimensional
matrices). Hence, the two middle terms can be collected into one
term:
Let’s try to remove all the instances of transpose (again, using the
distribution identities):
We can remove the last term because, when deriving with respect to
Θ, it has no effect and the derivative is zero (since the term Θ is not
involved):
Let us evaluate the left term first:
EDIT: I want to add intuition for the rule here. Recall that a vector
multiplied by its transpose results in a vector where each value is
squared — but this is not the same as the square of the vector itself. If
we were to take the derivative of this vector with respect to the
original vector, we apply the basic principle which states that we
derive by each value in the vector separately:
That simplifies to:
So now, using this identity, we can solve the derivative of the left term
to be:
EDIT: I hope the intuition behind this identity becomes clear using
the same principles in the previous explanation.
Now we can combine the two terms together to formulate the final
derivative for the cost function:
Recall that the only turning point — a point where the instantaneous
gradient is zero — for a convex function will be the global
minimum/maximum of the function. So, we set the derivative to
zero and hence find said minimum point:
By setting the derivative of the cost function to zero, we are änding the Θ that attains the lowest cost
The inverse of any value multiplied by the value itself is just 1, so…:
We got it!
Here we have it — finally! This is our solution for Θ where the cost is
at its global minimum.
. . .
I hope that was rewarding for you. It definitely was for me. This was
my first step into matrix calculus, which proved to be a stimulating
challenge. Next article will be even more daring: a full Mathematical
writeup on backpropagation! See you then!