Proving the Gradient Descent Exercise (Neural Networks and Deep Learning)

Chapter 1 of Neural Networks and Deep Learning by Michael Nielsen presents the following question. How should we choose a vector of changes (of fixed magnitude) in order to minimize \Delta{C}\approx\nabla{C}\cdot\Delta{v} or, to analogize, suppose you are at the top of a hill looking to find the fastest way down. What direction should you take a step towards in order to bring you closer to the bottom of the hill the quickest?

Using Cauchy-Schwarz:

\lvert\nabla{C}\cdot\Delta{v}\rvert \leq \lvert\lvert\nabla{C}\rvert\rvert\lvert\lvert\Delta{v}\rvert\rvert

In the interest of minimization, suppose the two vectors are linearly dependent. Then by the equality case of Cauchy-Schwarz

\nabla{C}\cdot\Delta{v} = -\lvert\lvert\nabla{C}\rvert\rvert\lvert\lvert\Delta{v}\rvert\rvert

We know \Delta{v}=t \nabla{C} for constant t (from the linear dependency) and the dot product of the two vectors is negative thus they “point” in opposite directions with t<0 and we can say t=-\eta. So the choice that decreases our cost function the fastest for a fixed step of size \lvert\lvert\Delta{v}\rvert\rvert = \epsilon is \Delta{v}=-\eta\nabla{C} for some small, positive \eta.

Leave a comment