Proving the Gradient Descent Exercise (Neural Networks and Deep Learning)

Chapter 1 of Neural Networks and Deep Learning by Michael Nielsen presents the following question. How should we choose a vector of changes (of fixed magnitude) in order to minimize $\Delta{C}\approx\nabla{C}\cdot\Delta{v}$ or, to analogize, suppose you are at the top of a hill looking to find the fastest way down. What direction should you take a step towards in order to bring you closer to the bottom of the hill the quickest?

Using Cauchy-Schwarz:

$\lvert\nabla{C}\cdot\Delta{v}\rvert \leq \lvert\lvert\nabla{C}\rvert\rvert\lvert\lvert\Delta{v}\rvert\rvert$

In the interest of minimization, suppose the two vectors are linearly dependent. Then by the equality case of Cauchy-Schwarz

$\nabla{C}\cdot\Delta{v} = -\lvert\lvert\nabla{C}\rvert\rvert\lvert\lvert\Delta{v}\rvert\rvert$

We know $\Delta{v}=t \nabla{C}$ for constant $t$ (from the linear dependency) and the dot product of the two vectors is negative thus they “point” in opposite directions with $t<0$ and we can say $t=-\eta$ . So the choice that decreases our cost function the fastest for a fixed step of size $\lvert\lvert\Delta{v}\rvert\rvert = \epsilon$ is $\Delta{v}=-\eta\nabla{C}$ for some small, positive $\eta$ .