Why divide by \((n-1)\) when calculating sample variance?

Gagarine Yaikhom

15 June 2015, Monday

For a given population of size \(N\), where \(X_i\) is the measured value for the \(i\)th entity in the population, the mean (\(\mu\)) and variance (\(\sigma^2\)) of the population are defined respectively as follows:

\[ \begin{align} \mu &= \frac{1}{N}\sum_{i=1}^NX_i\\ \sigma^2 &= \frac{1}{N}\sum_{i=1}^N(X_i - \mu)^2 \end{align} \]

From this population, if we get a random sample of size \(n\) with replacement, where \(x_i\) is the \(i\)th measured value sampled from the population, the mean (\(\bar x\)) and the variance (\(s^2\)) of the sample is defined as follows:

\[ \begin{equation} \label{eqn:sample mean} \bar x = \frac{1}{n}\sum_{i=1}^nx_i \end{equation} \]

\[ \begin{equation} \label{eqn:sample variance} s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2 \end{equation} \]

In this note, we try to understand why we divide by \((n - 1)\) instead of \(n\) when calculating the sample variance in \(\eqref{eqn:sample variance}\).

The purpose of a sample statistic, like the mean and variance, is to estimate the corresponding population parameter based on a random sample of the population. We require such estimators because when the population size is very large, it is often impractical to record the measured values for every entity in the population. By using a reasonably sized random sample drawn from the population, we can instead use the sample statistic as an estimate of the population parameter. Hence, it is important to choose an estimator that produces a value that is close to the true population parameter.

If a sample statistic overestimates or underestimates the population parameter, the estimator is referred to as a biased estimator. An overestimation happens when the sample statistic produces a value that is larger than the true population parameter. Similarly, an underestimation happens when the sample statistic produces a value that is smaller than the true population parameter.

Let us define \(\sigma_x^2\) as the sample variance calculated with division by \(n\):

\[ \begin{equation} \label{eqn:biased sample variance} \sigma_x^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar x)^2 \end{equation} \]

It turns out that \(\sigma_x^2\) significantly underestimates the population variance, and therefore is a biased estimator. On the other hand, sample variance as defined in \(\eqref{eqn:sample variance}\) is an unbiased estimator. We can check this by running simulations, such as this, this and this, available at KhanAcademy.

In the following, we derive Bessel’s correction which defines a formal relationship between the biased sample variance \(\sigma_x^2\) shown in \(\eqref{eqn:biased sample variance}\) and the true population variance \(\sigma^2\). I learned this derivation from (Wikipedia); however, I have expanded some of the steps in the derivation to make it clearer.

Let us begin. We start with \(\eqref{eqn:biased sample variance}\) and proceed as follows:

\[ \begin{equation} \begin{aligned} \sigma_x^2 & = \frac{1}{n}\sum_{i=1}^n(x_i - \bar x)^2\\ & = \frac{1}{n}\sum_{i=1}^n(x_i - \frac{1}{n}\sum_{j=1}^nx_j)^2\\ & = \frac{1}{n}\sum_{i=1}^n(x_i^2 - \frac{2}{n}x_i\sum_{j=1}^nx_j + \frac{1}{n^2}\sum_{j=1}^nx_j\sum_{k=1}^nx_k) \end{aligned} \label{eqn:sum terms} \end{equation} \]

In the first step, we substitute \(\bar x\) using \(\eqref{eqn:sample mean}\). We then expand using \((a-b)^2 = a^2 -2ab + b^2\) to give \(\eqref{eqn:sum terms}\). Note that \((\sum_{j=1}^nx_j)^2\) is rewritten as \(\sum_{j=1}^nx_j\sum_{k=1}^nx_k\), which is essentially the same expression.

In \(\eqref{eqn:sum terms}\), \(x_i\) can be treated as a constant inside each of the inner summations. Hence we can expand the second term in \(\eqref{eqn:sum terms}\) as follows:

\[ \begin{equation} \begin{aligned} \frac{2}{n}x_i\sum_{j=1}^nx_j &= \frac{2}{n}\sum_{j=1}^nx_ix_j\nonumber \\ & = \frac{2}{n}(x_ix_1 + x_ix_2 + \cdots + x_ix_i + \cdots + x_ix_n)\nonumber \\ & = \frac{2}{n}(x_ix_i + \sum_{j \neq i}^nx_ix_j)\nonumber \\ & = \frac{2}{n}x_i^2 + \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nx_ix_j \end{aligned} \label{eqn:second term} \end{equation} \]

On expanding the third term in \(\eqref{eqn:sum terms}\), we get:

\[ \begin{equation} \begin{aligned} \frac{1}{n^2}\sum_{j=1}^nx_j\sum_{k=1}^nx_k & = \frac{1}{n^2}(x_1+x_2+\cdots + x_n)(x_1+x_2+\cdots + x_n)\nonumber\\ & = \frac{1}{n^2}\big[ x_1^2 + x_1x_2 + \cdots + x_1x_n +\nonumber\\ & \qquad x_2x_1 + x_2^2 + \cdots + x_2x_n +\nonumber\\ & \qquad \cdots \nonumber\\ & \qquad x_nx_1 + x_nx_2 + \cdots + x_n^2\nonumber \big]\nonumber\\ & = \frac{1}{n^2}\sum_{j=1}^nx_j^2 + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nx_jx_k \end{aligned} \label{eqn:third term} \end{equation} \]

Note here that the first summation in \(\eqref{eqn:third term}\) corresponds to the diagonal. The second double summation corresponds to the remaining non-diagonal values.

Substituting equations \(\eqref{eqn:second term}\) and \(\eqref{eqn:third term}\) to \(\eqref{eqn:sum terms}\), we get:

\[ \begin{equation} \begin{aligned} \sigma_x^2 & = \frac{1}{n}\sum_{i=1}^n(x_i^2 - \frac{2}{n}x_i^2 - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nx_ix_j + \frac{1}{n^2}\sum_{j=1}^nx_j^2 + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nx_jx_k) \end{aligned} \label{eqn:fully expanded} \end{equation} \]

Note that \(x_i\) is a random variable that represents the values sampled from the population. Similarly, \(\sigma_x^2\) is also a random variable that represents the sample variance of samples drawn from the population. Hence, we can ascertain the expected values of these random variables in \(\eqref{eqn:fully expanded}\).

Now, using the following properties of expected value of a random variable:

\(E[c] = c\)
\(E[cX] = cE[x]\)
\(E[X + c] = E[X] + c\)
\(E[X + Y] = E[X] + E[Y]\)
\(E[XY] = E[X]E[Y]\)

where \(c\) is a constant and \(X\) and \(Y\) are independent random variables, we get

\[ \begin{equation} \begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nE[x_ix_j]\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nE[x_jx_k]\bigg) \end{aligned} \label{eqn:expected} \end{equation} \]

Due to the random sampling with replacement, \(x_i\), \(x_j\) and \(x_k\) represent three independent random variables when \(i \neq j \neq k\). Furthermore, the expected value of a random variable that represents a sample value drawn from a population equals the population mean, i.e., \(E[x_i] = \mu\). Hence, we have

\[ \begin{aligned} E[x_ix_j] = E[x_i]E[x_j] = E[x_j]E[x_k] = E[x_jx_k] = \mu^2 \end{aligned} \]

Substituting these values in \(\eqref{eqn:expected}\), we get:

\[ \begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^n\mu^2\nonumber\\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber\\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{n(n-1)}{n^2}\mu^2\bigg) \label{eqn:expected before} \end{aligned} \]

Finally, since \(E[x_i^2] = \sigma^2 + \mu^2\) (derivation here), we have:

\[ \begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}(\sigma^2 + \mu^2) - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^n(\sigma^2 + \mu^2) + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber \\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}(\sigma^2 + \mu^2) - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{n}{n^2}(\sigma^2 + \mu^2) + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber \\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 1}{n}\sigma^2\bigg)\nonumber \\ & = \frac{n - 1}{n}\sigma^2 \end{aligned} \]

Thus, to correct the bias in \(\sigma_x^2\), we must multiply the calculated statistic in \(\eqref{eqn:biased sample variance}\) with \(n/(n-1)\), thus giving us the unbiased variance estimator in \(\eqref{eqn:sample variance}\). This is referred to as Bessel’s correction.