The intuition of gradient descent

Modern deep learning methods are built on top this math method, but what’s its underlying intuition? Let me try to explain it in my words.

First let us setup the problem: to figure out the an equation’s parameter acoording to some examples. For example, imagine that our equation is as

 $y = Ax$

here $A$ is the parameter to be figured out, and $y, x$ are variables. The examples we may have are tuples of specific $y, x$ , like $y=5, x=1$ or $y=10, x=2$ etc. To make things simple, image that we have a one example, like $y=5, x=1$ , let us figure out what is $A$ .

Humans may say, $A = \frac{y}{x}$ , so with our example, $A = \frac{5}{1} = 5$ . End of story.

But our computer is not that smart, so it tries to solve it in a way of

start from a guess of A, like A = 0

try calc y with x=1 and A = 0, get y = 0

compare the caculated y = 0 to real example y = 5, find itself wrong, so go back to 1 and do another guess, until it feels the guess is good enough and stop

We may say that is a stupid method! But to computer, seems like a OK one. Because it does not require the knowledge of the equation: No matter our equation is one as simple as above, or a very complex one like $y = A(x^5 + sin(x) + \sqrt{x^9})$ , it will somehow work. So let us further develop this method.

First thing is of course to figure out what is good enough: which can be defined as close enough to example value. Specifically, we can substract the calculated y from the example y, like $y_e - Ax_e$ ( $y_e$ and $x_e$ are the example), and if this value is small enough then we think it is good. To avoid process of positive and negative values, we can square it, $(y_e - Ax_e)^2$ so it will always positive, and the smaller the better.

The next important thing is figure out how to guess when something is not satisfying. Sometime random guess works, but is there a smart way. To figure out this, we can try visualize the problem with drawing some curves

parabolic curve of loss

In above drawing, what we did is to guess 3 times, with A1, A2 and A3, and we draw the point with our guess in the coordinate plane. Because we are lucky, we have almost figure out what’s the curve will look alike: it is a parabolic curve facing upwards.

In fact the parabolic curve can be described by $f(A) = (y_e-Ax_e)^2$ , with math we learn from high school. Notice that $A$ is now the variable, because $y_e, x_e$ are both constants.

And we kind of inspired by this drawing that we may can guide the guessing with this parabolic curve in mind.

slop is our best guess

Now we assume we are not that lucky, and now at the time of already tried A1 and want to figure out what’s the best for A2. At the point of A1, we think, if we can follow the curve then will be the best. In math, following the curve equals to follow the slop at that point (the red line), so we draw the slop and extend it to cross x axis. The crossing point on x axis is the A2 point we want to figure out.

To figure out A2, we mark the distance between A1 and A2 to be $\theta$ , and we notice that

 $\frac{-(y_e-A_1x_e)^2}{\theta} = \frac{0 - (y_e-A_1x_e)^2}{A_2-A_1} = \frac{d(y_e-Ax_e)^2}{dA}(A_1)$

and because

 $\frac{d(y_e-Ax_e)^2}{dA} = \frac{d(y_e^2-2y_ex_eA+x_e^2A^2)}{dA} = -2y_ex_e+2x_eA$

then we solve $\theta$ as

 $\theta = -(y_e-A_1x_e)^2 \cdot (\frac{d(y_e-Ax_e)^2}{dA}(A_1))^{-1}$

 $= -(y_e-A_1x_e)^2 \cdot (-2y_ex_e+2x_eA_1)$

look at the right side, $y_e, A_1, x_e$ are all known, so we can calc $\theta$ in the end of step 3. And next guess we guess $A_2 = A_1 + \theta$ .

A small problem of above is that, we calculated next $A_2$ based on the improvement of $(y_e-A_1x_e)^2$ , which is usually a bad idea in pratice because it appears too confident. So usually we will use a smaller improvement value like $r$ (called learning ratio), and with that

 $\theta = -r \cdot (-2y_ex_e+2x_eA_1)$

 $A_2 = A_1 + \theta = A_1 -r \cdot (-2y_ex_e+2x_eA_1)$

Above is the intuition behind gradient descent. The important ideas are

The stupid guess and iterate method. Remember this method as through time we see it may be the most important method in all science.

Drawing curve and figure out which one is variable and which one is not. This is really the key step as it is the start of all following things.