Finding direction of steepest ascent

During backpropagation, how do we decide how to update the weights in the neural network?

The fundamental method was introduced¹ on 18 October 1847 by Augustin-Louis Cauchy (1789–1857) in his Méthode générale pour la résolution des systèmes d’équations simultanées². Jacques Salomon Hadamard (1865–1963) gave³ it the name “method of descent”.

Improving weights

After we have decided how often we plan to update the weights in each epoch, how do we actually update the weights? There are several approaches for calculating this, and they are all based on the optimisation of an objective function.

So, what is an objective function? To answer this question, let us first answer this: what are we trying to achieve with training? Our goal with training is to acquire an ability to predict certain properties of a sample, knowing that the sample comes from a distribution that we learned our model from. Say we already knew the correct parameters for the model, i.e., the required weights, we can simply use these weights to evaluate a given sample and achieve our goal immediately, as this evaluation will provide the correct prediction. There will be no need for training.

Unfortunately, we do not know a priori the model parameters. Hence, to differentiate the various weights that is possible, we must figure out a way of evaluating each weight against our goal. This is what an objective function provides—a mechanism for us to evaluate the weights against our goal for learning. For instance, let’s say we are learning to classify an item from a population of such items (i.e., same distribution) into either “good” or “bad”. We start with a random sample of items from this population, which are then categorised into “good” or “bad” following the criteria set out by our goal. These items and their corresponding label together define the training set. Hence, a simple objective function that we could use during batch update is a function that calculates the number of failed classifications during each epoch, where by “failure” we mean prediction that does not match supplied label for that item in the training set. It is easy to see that good weights will have fewer incorrect classifications as compared to bad weights. Thus this simple function allows us to compare between weights.

Are all functions that allow us to evaluate weights against a training set always a good objective function for updating weights? For instance, is the objective function as defined in our example classification a good objective function? Does it allow us to update a given set of weights so that they get better? Unfortunately, our example objective function is not a good objective function for training. Why? Because, just by knowing the number of misclassifications, there is no way for us to update the weights so that the number of misclassifications reduces, thus improving the performance of the weights. This brings us to one important property required in a good objective function—it should allow us to update the weights. In other words, the objective function must be a function of not only the labels (predicted and supplied), but also the weights that were used to carry out the prediction. More formally, if we denote the model parameters as \(\theta\) with \(p\) weights, and the predicted and supplied sample properties respectively as \(\hat{y}\) and \(y\), then an objective function \(f\) should take the form \(f(\hat{y}, y, \theta)\), where \(\hat{y}, y\in\mathbb{R}\) and \(\theta \in \mathbb{R}^p\).

Why is this important? And why don’t we make the objective function a function of the sample features as well? For instance, let’s say each sample has \(d\) features in its feature set \(F\), which is combined with \(\theta\) to produce \(\hat{y}\), shouldn’t we therefore include \(F\) when defining the objective function \(f\)? In other words, should the objective function be \(f(\hat{y}, y, \theta, F)\)? This is an extremely important question. The answer is yes, but only if the \(d\) features in \(F\) are treated as constants inside \(f\) when calculating the required weight corrections. This will become clear in the next section.

Gradient descent

Given the weights \(\theta^{(k)}\) in the \(k\)th iteration of the weight update, the feature set \(F\), the predicted value \(\hat{y}\), supplied value \(y\), and the objective function \(f(\hat{y}, y, \theta, F)\), what we want to determine for a weight update is the summand \(\Delta\theta\) such that the weights \(\theta^{(k+1)}\) for the \((k+1)\)th iteration can be updated as follows:

\[ \begin{align} \theta^{(k+1)} = \theta^{(k)} + \Delta\theta \label{eqn:partial weight update rule} \end{align} \]

To determine \(\Delta\theta\), we need to know how \(f(\hat{y}, y, \theta, F)\) behaves as we change \(\theta\). Why? Because \(y\) and \(F\) are given and therefore constants for the sample under consideration in the \(k\)th iteration; on the other hand, \(\hat{y}\) depends on \(\theta\). Hence, knowing the behaviour of \(f\) with respect to \(\theta\) completely determines the behaviour of \(f\) for that sample. Once we know whether \(f\) increases or decreases as we change the weights in \(\theta\), and by how much, we can determine \(\Delta\theta\) by changing the values in the direction that maximises (or, minimises) \(f\). For a multivariable function \(f\), such a quantity is defined as the gradient of \(f\) with respect to \(\theta\), which is denoted as \(\nabla_{\theta}f\).

The gradient gives the direction of maximum ascent, which give the changes in \(\theta\) that will increases the value of \(f\) the fastest. Thus, \(\Delta\theta\propto\nabla_{\theta}f\). Since maximisation of an objective function can be reformulated as a minimisation (simply minimise the negative of the objective function) it is normal to pose the weight update problem as a minimisation problem. The objective function thus defined is generally referred to as the loss function. Hence, we choose weights that reduces the value of the loss function. Given this condition, \(\Delta\theta\propto-\nabla_{\theta}f\), which means that we move opposite the direction of maximum ascent. This is generally referred to as the gradient descent approach.

To complete \(\eqref{eqn:partial weight update rule}\), we need one final decision. We know in which direction we must change the weights to reduce the loss function, however, by how much? In machine learning this factor is referred to as the learning rate, and is denoted by \(\eta\). Hence, \(\eqref{eqn:partial weight update rule}\) becomes:

\[ \theta^{(k+1)} = \theta^{(k)} - \eta\nabla_{\theta}f \label{eqn:complete weight update rule}\textrm{ where, }\eta \in (0,1]. \]

Factors such as \(\eta\) and batch size (as used in mini-batch weight update) are generally referred to as hyper-parameters of the training setup. One of the key differences between various methods used to accelerate convergence will vary in the manner \(\eta\) changes during the training, which we shall see in the following sections.

Finding the direction of steepest ascent

What is the direction of steepest ascent for a differentiable multivariable function? To determine this, we will first show that the direction of steepest ascent at any given point, where the function is defined and differentiable, depends only on the gradient of the function at that point, and the angle the chosen direction makes to this gradient. We will then show that the maximum rate of change one can achieve in any possible direction is in fact the magnitude of the gradient, and that the direction in which this maximum is found is in the direction of the gradient.

Let \(f:\mathbb R^n\to \mathbb R\) be a function in \(n\) variables \(x_i\), where \(i = 1, \ldots, n\). The rate of change of \(f\) with respect to \(x_i\) is given by \({\partial f}/{\partial x_i}\), which is the partial derivative of \(f\) with respect to \(x_i\). By definition, when determining \({\partial f}/{\partial x_i}\) we are only varying \(x_i\) while keeping all \(x_j\) fixed, where \(j = 1, \ldots, n\) and \(j \ne i\). If we are given a vector \(\vec{a}\) such that \(x_i = a_i\) are its components, we can use \({\partial f(\vec{a})}/{\partial x_i}\) to determine the rate of change of \(f\) at \(\vec{a}\) with respect to \(x_i\).

Given a vector \(\vec{a}\), our objective is to determine the direction of greatest ascent at \(\vec{a}\). In other words, we wish to determine the direction that maximises the rate of change of \(f\) considering all possible directions from \(\vec{a}\). We cannot simply use the partial derivatives, as we will be varying all of the \(x_i\) components simultaneously when moving along the direction of a vector \(\vec{u}\). We, therefore, need to determine the rate of change of \(f\) with respect to a vector \(\vec{u}\). This is referred to as the directional derivative of \(f\) with respected to a vector \(\vec{u}\), and is denoted as \(D_{\vec{u}}f(x_1,\ldots,x_n)\). Since we are only interested in the direction, we will choose \(\vec{u}\) to be a unit vector so that \(\left\lVert \vec{u}\right\rVert = 1\).

If we denote with \(u_i\) the components of \(\vec{u}\), where \(i = 1, \ldots, n\), then by the definition of a derivative, we have:

\[ \begin{align} D_{\vec{u}}f(x_1,\ldots,x_n) = \lim_{h\to 0}\frac{f(x_1+u_1h,\ldots,x_n+u_nh)-f(x_1,\ldots,x_n)}{h}. \label{eqn:directional derivative definition} \end{align} \]

\[ \begin{align} g(z) = f(a_1+b_1z,\ldots,a_n+b_nz) \label{eqn:g(z) definition} \end{align} \]

for some constants \(a_i\) and \(b_i\), where \(i = 1, \ldots, n\). Again, by the definition of the derivative, we have the derivative of \(g\) with respect to \(z\) as

\[ \begin{align} g'(z) = \lim_{h\to 0}\frac{f(a_1+b_1(z+h),\ldots,a_n+b_n(z+h))-f(a_1+b_1z,\ldots,a_n+b_nz)}{h} \end{align} \]

\[ \begin{align} g'(0) = \lim_{h\to 0}\frac{f(a_1+b_1h,\ldots,a_n+b_nh)-f(a_1,\ldots,a_n)}{h}. \label{eqn:g'(0) definition with limits} \end{align} \]

Hence, from \(\eqref{eqn:directional derivative definition}\) and \(\eqref{eqn:g'(0) definition with limits}\), we have

\[ \begin{align} D_{\vec{u}}f(x_1,\ldots,x_n) = g'(0),\textrm{ where }x_i=a_i\textrm{ and }u_i=b_i. \label{eqn:directional derivative g'(0)} \end{align} \]

Now, in \(\eqref{eqn:g(z) definition}\), let \(y_i = a_i+b_iz\) so that \[\begin{aligned} g(z) = f(y_1,\ldots,y_n).\end{aligned}\]

\[ \begin{align} g'(z) &= \sum_{i=1}^n \frac{\partial f}{\partial y_i}(y_1,\ldots,y_n)\cdot\frac{dy_i}{dz}\\ &= \sum_{i=1}^n \frac{\partial f}{\partial y_i}(y_1,\ldots,y_n)\cdot b_i. \end{align} \]

When \(z=0\), the definition of \(y_i\) above means that \(y_i=a_i\); however, from \(\eqref{eqn:directional derivative g'(0)}\), we also have \(x_i=a_i\) and \(u_i=b_i\). Thus,

\[ \begin{align} g'(0) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(x_1,\ldots,x_n)\cdot u_i. \label{eqn:g'(0) partial derivative} \end{align} \]

Combining equations \(\eqref{eqn:directional derivative g'(0)}\) and \(\eqref{eqn:g'(0) partial derivative}\), we therefore have

\[ \begin{align} D_{\vec{u}}f(x_1,\ldots,x_n) = \sum_{i=1}^n \frac{\partial f}{\partial x_i}(x_1,\ldots,x_n)\cdot u_i. \label{eqn:derivational as partial derivatives} \end{align} \]

\[ \begin{align} \nabla f = \begin{bmatrix}\dfrac{\partial f}{\partial x_1}(x_1,\ldots,x_n)\\\vdots\\\dfrac{\partial f}{\partial x_n}(x_1,\ldots,x_n)\end{bmatrix} \end{align} \]

\[ \begin{align} D_{\vec{u}}f(x_1,\ldots,x_n) = \nabla f\cdot\vec{u}. \label{eqn:derivational as dot product} \end{align} \]

Thus, for any unit vector \(\vec{u}\) the rate of change of \(f\) at \(\vec{a}\) in the direction of \(\vec{u}\) is given by the dot product of \(\vec{u}\) and \(\nabla f\) at \(\vec{a}\). To get the maximum rate of change, we must therefore find the maximum value of \(\nabla f\cdot\vec{u}\).

From the definition of a dot product, we have \(\nabla f\cdot\vec{u} = \left\lVert \nabla f\right\rVert\left\lVert \vec{u}\right\rVert\cos \theta\), where \(\theta\) is the angle between \(\nabla f\) and \(\vec{u}\). Since \(\vec{u}\) is a unit vector, \(\left\lVert \vec{u}\right\rVert = 1\), the maximum rate of change will happen when \(\left\lVert \nabla f\right\rVert\cos \theta\) is maximum. This means that the maximum rate of change depends only on the gradient of \(f\) and the angle between the gradient and \(\vec{u}\). Since the gradient is independent of \(\vec{u}\), to find the direction of maximum rate of change we only need to find the angle \(\theta\) that maximises \(\cos \theta\). We known that \(\cos \theta\) lies in the interval \([-1, 1]\). Hence, the maximum rate of change happens when \(\cos \theta = 1\). This means that the direction of maximum rate of change is the one where \(\theta = 0\). In other words, the maximum rate of change \(\left\lVert \nabla f\right\rVert\), and therefore the steepest ascent, is achieved when we move in the direction of the gradient.

Augustin-Louis Cauchy. “Œuvres complètes”. In: vol. 10. 1. Paris: Gauthier-Villars, 1847. Chap. Méthode générale pour la résolution des systèmes d’équations simultanées, pp. 536–538.↩︎
See translation: Claude Lemaréchal. “Cauchy and the Gradient Method”. In: Documenta Mathe- matica (2012), pp. 251–254..↩︎
Jacques Salomon Hadamard. Lectures on Cauchy’s Problem in Linear Partial Differential Equa- tions. Yale University Press, 1923, p. 49.↩︎