Why divide by (n-1) when calculating sample variance

June 15, 2015

For a given population of size \(N\), where \(X_i\) is the measured value for the \(i\)th entity in the population, the mean (\(\mu\)) and variance (\(\sigma^2\)) of the population are defined respectively as follows:

\[\mu = \frac{1}{N}\sum_{i=1}^NX_i\]

\[\sigma^2 = \frac{1}{N}\sum_{i=1}^N(X_i - \mu)^2\]

From this population, if we get a random sample of size \(n\) with replacement, where \(x_i\) is the \(i\)th measured value sampled from the population, the mean (\(\bar x\)) and the variance (\(s^2\)) of the sample is defined as follows:

\begin{equation}\label{eqn:sample mean} \bar x = \frac{1}{n}\sum_{i=1}^nx_i\end{equation}

\begin{equation}\label{eqn:sample variance} s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar x)^2\end{equation}

In this note, we try to understand why we divide by \((n - 1)\) instead of \(n\) when calculating the sample variance in equation \ref{eqn:sample variance}.

The purpose of a sample statistic, like the mean and variance, is to estimate the corresponding population parameter based on a random sample of the population. We require such estimators because when the population size is very large, it is often impractical to record the measured values for every entity in the population. By using a reasonably sized random sample drawn from the population, we can instead use the sample statistic as an estimate of the population parameter. Hence, it is important to choose an estimator that produces a value that is close to the true population parameter.

If a sample statistic overestimates or underestimates the population parameter, the estimator is referred to as a biased estimator. An overestimation happens when the sample statistic produces a value that is larger than the true population parameter. Similarly, an underestimation happens when the sample statistic produces a value that is smaller than the true population parameter.

Let us define \(\sigma_x^2\) as the sample variance calculated with division by \(n\):

\begin{equation}\label{eqn:biased sample variance} \sigma_x^2 = \frac{1}{n}\sum_{i=1}^n(x_i - \bar x)^2\end{equation}

It turns out that \(\sigma_x^2\) significantly underestimates the population variance, and therefore is a biased estimator. On the other hand, sample variance as defined in equation \ref{eqn:sample variance} is an unbiased estimator. We can check this by running simulations, such as this, this and this, available at KhanAcademy (https://www.khanacademy.org).

In the following, we derive Bessel's correction which defines a formal relationship between the biased sample variance \(\sigma_x^2\) shown in equation \ref{eqn:biased sample variance} and the true population variance \(\sigma^2\). I learned this derivation from (Wikipedia); however, I have expanded some of the steps in the derivation to make it clearer.

Let us begin. We start with equation \ref{eqn:biased sample variance} and proceed as follows:

\begin{equation}\begin{aligned} \sigma_x^2 & = \frac{1}{n}\sum_{i=1}^n(x_i - \bar x)^2\\ & = \frac{1}{n}\sum_{i=1}^n(x_i - \frac{1}{n}\sum_{j=1}^nx_j)^2\\ & = \frac{1}{n}\sum_{i=1}^n(x_i^2 - \frac{2}{n}x_i\sum_{j=1}^nx_j + \frac{1}{n^2}\sum_{j=1}^nx_j\sum_{k=1}^nx_k)\end{aligned}\label{eqn:sum terms}\end{equation}

In the first step, we substitute \(\bar x\) using equation \ref{eqn:sample mean}. We then expand using \((a-b)^2 = a^2 -2ab + b^2\) to give equation \ref{eqn:sum terms}. Note that \((\sum_{j=1}^nx_j)^2\) is rewritten as \(\sum_{j=1}^nx_j\sum_{k=1}^nx_k\), which is essentially the same expression.

In equation \ref{eqn:sum terms}, \(x_i\) can be treated as a constant inside each of the inner summations. Hence we can expand the second term in equation \ref{eqn:sum terms} as follows:

\begin{equation}\begin{aligned} \frac{2}{n}x_i\sum_{j=1}^nx_j & = \frac{2}{n}\sum_{j=1}^nx_ix_j\nonumber \\ & = \frac{2}{n}(x_ix_1 + x_ix_2 + \cdots + x_ix_i + \cdots + x_ix_n)\nonumber \\ & = \frac{2}{n}(x_ix_i + \sum_{j \neq i}^nx_ix_j)\nonumber \\ & = \frac{2}{n}x_i^2 + \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nx_ix_j)\end{aligned}\label{eqn:second term}\end{equation}

On expanding the third term in equation \ref{eqn:sum terms}, we get:

\begin{equation}\begin{aligned} \frac{1}{n^2}\sum_{j=1}^nx_j\sum_{k=1}^nx_k & = \frac{1}{n^2}(x_1+x_2+\cdots + x_n)(x_1+x_2+\cdots + x_n)\nonumber \\ & = \frac{1}{n^2}\big[ x_1^2 + x_1x_2 + \cdots + x_1x_n +\nonumber \\ & \qquad x_2x_1 + x_2^2 + \cdots + x_2x_n +\nonumber \\ & \qquad \cdots \nonumber \\ & \qquad x_nx_1 + x_nx_2 + \cdots + x_n^2\nonumber \big]\nonumber \\ & = \frac{1}{n^2}\sum_{j=1}^nx_j^2 + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nx_jx_k\end{aligned}\label{eqn:third term}\end{equation}

Note here that the first summation in equation \ref{eqn:third term} corresponds to the diagonal. The second double summation corresponds to the remaining non-diagonal values.

Substituting equations \ref{eqn:second term} and \ref{eqn:third term} to \ref{eqn:sum terms}, we get:

\begin{equation}\begin{aligned} \sigma_x^2 & = \frac{1}{n}\sum_{i=1}^n(x_i^2 - \frac{2}{n}x_i^2 - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nx_ix_j + \frac{1}{n^2}\sum_{j=1}^nx_j^2 + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nx_jx_k)\end{aligned}\label{eqn:fully expanded}\end{equation}

Note that \(x_i\) is a random variable that represents the values sampled from the population. Similarly, \(\sigma_x^2\) is also a random variable that represents the sample variance of samples drawn from the population. Hence, we can ascertain the expected values of these random variables in equation \ref{eqn:fully expanded}.

Now, using the following properties of expected value of a random variable:

  • \(E[c] = c\)

  • \(E[cX] = cE[x]\)

  • \(E[X + c] = E[X] + c\)

  • \(E[X + Y] = E[X] + E[Y]\)

  • \(E[XY] = E[X]E[Y]\)

where \(c\) is a constant and \(X\) and \(Y\) are independent random variables, we get

\begin{equation}\begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^nE[x_ix_j]\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{1}{n^2}\sum_{j=1}^n\sum_{\substack{k = 1\\k \neq j}}^nE[x_jx_k]\bigg)\end{aligned}\label{eqn:expected}\end{equation}

Due to the random sampling with replacement, \(x_i\), \(x_j\) and \(x_k\) represent three independent random variables when \(i \neq j \neq k\). Furthermore, the expected value of a random variable that represents a sample value drawn from a population equals the population mean, i.e., \(E[x_i] = \mu\). Hence, we have

\[\begin{aligned} E[x_ix_j] = E[x_i]E[x_j] = E[x_j]E[x_k] = E[x_jx_k] = \mu^2\end{aligned}\]

Substituting these values in equation \ref{eqn:expected}, we get:

\[\begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2}{n}\sum_{\substack{j = 1\\j \neq i}}^n\mu^2\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber\\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}E[x_i^2] - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^nE[x_j^2] + \frac{n(n-1)}{n^2}\mu^2\bigg)\label{eqn:expected before}\end{aligned}\]

Finally, since \(E[x_i^2] = \sigma^2 + \mu^2\) (derivation here), we have:

\[\begin{aligned} E[\sigma_x^2] & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}(\sigma^2 + \mu^2) - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{1}{n^2}\sum_{j=1}^n(\sigma^2 + \mu^2) + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber \\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 2}{n}(\sigma^2 + \mu^2) - \frac{2(n-1)}{n}\mu^2\nonumber \\ & + \frac{n}{n^2}(\sigma^2 + \mu^2) + \frac{n(n-1)}{n^2}\mu^2\bigg)\nonumber \\ & = \frac{1}{n}\sum_{i=1}^n\bigg(\frac{n - 1}{n}\sigma^2\bigg)\nonumber \\ & = \frac{n - 1}{n}\sigma^2\end{aligned}\]

Thus, to correct the bias in \(\sigma_x^2\), we must multiply the calculated statistic in equation \ref{eqn:biased sample variance} with \(n/(n-1)\), thus giving us the unbiased variance estimator in equation \ref{eqn:sample variance}. This is referred to as Bessel's correction.

comments powered by Disqus