Understanding the method of least squares

Gagarine Yaikhom

19 January 2021, Tuesday

Imagine you are given two sets of \(N\) values \(x_i\) and \(y_i\), where \(1 \le i \le N\). If our objective is to find a function such that:

\[ \begin{equation} \label{eqn:y function of x} y_i = f(x_i), \end{equation} \]

how would we go about finding such a function?

In the above formulation, the value of \(y\) depends on the value of \(x\). However, the value of \(x\) does not depend on the value of \(y\). The variable \(x\) is therefore referred to as the independent variable and the variable \(y\), the dependent variables.

As we can see, we can define an infinite number of functions that satisfy the above equations. However, our objective is to find one that is simple to compute. Fitting a line through the dataset is the simplest one. In other words, we can reformulate \(\eqref{eqn:y function of x}\) as follows:

\[ \begin{equation} \label{eqn:linear formulation} y_i = \beta_0 + \beta_1x_i + \epsilon_i. \end{equation} \]

Here, \(\beta_0\) and \(\beta_1\) are the parameters of the linear function, also referred to as the linear coefficients, whereas, \(\epsilon_i\) is an error term for points that do not lie on the line. As we know, \(\beta_0\) is the intercept of the line on the \(y\)-axis, and \(\beta_1\) is the slope of the line.

Now, even though the formulation in \(\eqref{eqn:linear formulation}\) has reduced the number of functions to look for, we still need to determine the coefficient \(\beta_0\) and \(\beta_1\) that provides the optimal fit. These coefficients are referred to as the parameters of the model as they are fixed for all values of \(x\) and \(y\). Furthermore, the aim of developing a model is so that we can predict values of \(y\) given new values of \(x\).

Let us define the predicted \(y\) values as \(\hat{y}\). Then, given a model with parameters \(\beta_0\) and \(\beta_1\), we will have for a value \(x\):

\[ \begin{equation} \label{eqn:predicted y} \hat{y} = \beta_0 + \beta_1 x. \end{equation} \]

You must be wondering where there is an error term in \(\eqref{eqn:linear formulation}\), there is no error term in \(\eqref{eqn:predicted y}\). This is because, \(\eqref{eqn:linear formulation}\) defines a relationship between known values of \(x\) and \(y\), hence, if the model fails to correct the correct value, there will be found an error component. This means that, depending on the chosen values of the model parameters we can calculate the error term as follows:

\[ \begin{align} \epsilon_i &= y_i - \hat{y_i}\\ &= y_i - (\beta_0 - \beta_1 x_i) \end{align} \]

In \(\eqref{eqn:predicted y}\), we only know the \(x\) values, and since the \(y\) value is unknown we cannot calculate the error term. In fact, the whole point of this note is to devise a mechanism to determine the model parameters such that these error terms are minimised, as that’s the best we can actually do.

Given two sets of values for \(\beta_0\) and \(\beta_1\), which values do you choose? Since our objective is to find models that provide the best prediction of \(y\) given \(x\), we want a model that makes the least amount of mistakes, or errors in the predicted value. Now, if we were given only a pair of \(x\) and \(y\) values, we can define an infinite number of models that passed through that point \((x, y)\). As we increase the number of points available, our model becomes constrained. For instance, if we are given only two points \((x_1, y_1)\) and \((x_2, y_2)\), we can define \(\beta_0\) and \(\beta_1\) as follows to get a model with no errors:

\[ \begin{align} \beta_1 &= \frac{y_2 - y_1}{x_2 - x_1},\textrm{ and}\\ \beta_0 &= y_1 - \beta_1x_1. \end{align} \]

However, once we have more than two points that are not collinear, we will start getting error terms from the model fit and the supplied points. Hence, how do we find the best fit that minimises all of the error terms for all of the supplied points?

For a given model, why don’t we add all of the error? That way, we can find the best model by finding the values of \(\beta_0\) and \(\beta_1\) which minimises this total error. In other words, instead of optimising \(\beta_0\) and \(\beta_1\) for each point, which may not work in general if the points are not collinear—like fitting a poorly sized carpet on a floor—we can try to optimise \(\beta_0\) and \(\beta_1\) for all points simultaneously by reaching a suitable compromise. Unfortunately, simply adding the error terms to define an overall error will not work since the error terms could be both positive and negative. Hence, direct summation might cancel out, making the sum of the error terms unfit for the minimisation task.

Sum of absolute deviations

One way of making the above suggestion to work is if we consider the deviations from the expected values. In other words, we do not just add the error terms, but we add their absolute values, as follows:

\[ \begin{align} \gamma &= \sum_{i=1}^{N}\lvert y_i - \hat{y_i}\rvert\nonumber\\ &= \sum_{i=1}^{N}\lvert y_i - \beta_0 - \beta_1x_i\rvert.\label{eqn:sum of absolute diff} \end{align} \]

In fact, this formulation is called the sum of absolute deviations. We can therefore find the values of \(\beta_0\) and \(\beta_1\) such that \(\gamma\) is minimised. To do this, differentiate \(\gamma\) with respect to \(\beta_0\) and \(\beta_1\), and choose the values of \(\beta_0\) and \(\beta_1\), where this differential is zero:

\[ \begin{align} \frac{\partial\gamma}{\partial \beta_0} &= 0,\textrm{ and}\nonumber\\ \frac{\partial\gamma}{\partial \beta_1} &= 0.\label{eqn:derivatives wrt betas} \end{align} \]

We know that for a function \(f(u) =\lvert u\rvert\), we have \(f(u) = \sqrt{u^2}\) Hence, the chain rule gives us:

\[ \begin{align} \frac{d}{dx}f(u) &= \frac{1}{2\sqrt{u^2}}\cdot 2u\cdot\frac{du}{dx}\\ &= \frac{u}{\lvert u\rvert}\cdot\frac{du}{dx}. \end{align} \]

Applying this to \(\eqref{eqn:derivatives wrt betas}\) we get,

\[ \begin{align} \frac{\partial\gamma}{\partial \beta_0} &= \sum_{i=1}^{N}\frac{y_i-\beta_0-\beta_1x_i}{\lvert y_i - \beta_0 - \beta_1x_i \rvert}(-1) = 0\\ \end{align} \]

and

\[ \begin{align} \frac{\partial\gamma}{\partial \beta_1} &= \sum_{i=1}^{N}\frac{y_i-\beta_0 - \beta_1x_i}{\lvert y_i - \beta_0 - \beta_1 x_i\rvert}\cdot(-x_i) = 0. \end{align} \]

It is difficult to solve this analytically to determine the values \(\beta_0\) and \(\beta_1\) that satisfy the above equations. Hence, we can use numerical methods to solve the system of equations. For instance, the Barrodale-Roberts algorithm uses the simplex method.

There is another method we can use to determine the best fit. This is referred to as the method of least squares method.

The method of least squares

In the least squares method, the loss function is defined as follows:

\[ \begin{equation} J(f(x_i);x_i) = \sum_{i=1}^{N}(y_i-\hat{y_i})^2. \end{equation} \]

Here, \(J\) is a function defined by the values of the independent variable \(x\) and the \(y\) values predicted by our model \(f(x_i)\), which \(J\) takes as model parameters. To find the best fit, therefore, we wish to minimise the function \(J\) with respect to the model parameters \(\beta_0\) and \(\beta_1\). In other words, we need to solve the following two simultaneous equations:

\[ \begin{align} \frac{\partial}{\partial\beta_0}J(f(x_i);x_i) &= 0,\textrm{ and}\nonumber\\ \frac{\partial}{\partial\beta_1}J(f(x_i);x_i) &= 0.\nonumber \end{align} \]

Or, we need to solve:

\[ \begin{align} \sum_{i=1}^{N}\frac{\partial}{\partial\beta_0} (y_i - \beta_0 - \beta_1x_i)^2 &= 0,\textrm{ and}\nonumber\\ \sum_{i=1}^{N}\frac{\partial}{\partial\beta_1} (y_i - \beta_0 - \beta_1x_i)^2 &= 0.\nonumber \end{align} \]

Once more, using the chain rule, this becomes:

\[ \begin{align} \sum_{i=1}^{N}2(y_i - \beta_0 - \beta_1x_i)(-1) &= 0,\textrm{ and}\nonumber\\ \sum_{i=1}^{N}2(y_i - \beta_0 - \beta_1x_i)(-x_i) &= 0.\nonumber \end{align} \]

or,

\[ \begin{align} \sum_{i=1}^{N}y_i - \sum_{i=1}^{N}\beta_0 - \sum_{i=1}^{N}\beta_1x_i &= 0,\textrm{ and}\label{eqn:beta 0 linear equation}\\ \sum_{i=1}^{N}x_iy_i - \sum_{i=1}^{N}\beta_0x_i - \sum_{i=1}^{N}\beta_1x_i^2 &= 0.\label{eqn:beta 1 linear equation} \end{align} \]

Hence, solving for \(\beta_0\) in \(\eqref{eqn:beta 0 linear equation}\) we get

\[ \begin{equation} \beta_0 = \frac{\sum y-\beta_1\sum x}{N}\label{eqn:beta 0 solved with beta 1} \end{equation} \]

To simplify the equations, we have remove the summation limits so that, by \(\sum\) we mean \(\sum_{i=1}^{N}\). Now, substituting \(\beta_0\) in \(\eqref{eqn:beta 1 linear equation}\), we get

\[ \sum xy-\sum\left[ \frac{\sum y-\beta_1\sum x}{N} \right]x - \beta_1\sum x^2=0. \]

Or,

\[ \begin{equation} \beta_1 = \frac{N\sum xy-\sum x\sum y}{N\sum x^2 - \left(\sum x\right)^2}.\label{eqn:beta 1 solved} \end{equation} \]

Substituting \(\beta_1\) back to \(\eqref{eqn:beta 0 solved with beta 1}\), we get:

\[ \begin{equation} \beta_0 = \frac{\sum x^2\sum y - \sum x\sum xy}{N\sum x^2 - \left(\sum x\right)^2}.\label{eqn:beta 0 solved} \end{equation} \]

Pending work

I still haven’t quite figured out how the statistical significance testing is carried out on these estimated parameters, and how a confidential interval may be define for the prameters. Furthermore, it will be nice to impement this in code.

Will update this article once I do.