Why are p-values distributed uniformly under the null hypothesis?

Gagarine Yaikhom

14 April 2016, Thursday

In this note, we show that the \(p\)-values are uniformly distributed under the null hypothesis. We shall do this in three parts. First, we specify the properties of a continuous uniform distribution. We then define \(p\)-values as a random variable that corresponds to a test statistic under a null hypothesis. Finally, we use the probability integral transform to show that this random variable has a continuous uniform distribution.

If \(X\) is a random variable with a continuous uniform distribution with support defined by parameters \(a\) and \(b\) (the minimum and maximum values of \(X\)), then its cumulative distribution function (CDF) is given by:

\[ F_X(x) = \begin{cases} 0 & \textrm{for } x < a\\ \dfrac{x-a}{b-a} & \textrm{for } a \le x < b\\ 1 & \textrm{for } x \ge b \end{cases} \]

where, \(x\) is a realisation of \(X\). Hence, the CDF for a random variable \(X\) with a continuous uniform distribution with support \([0,1]\) is given by \(F_X(x) = x\). This is the first part.

To state the second part, let \(T\) be a random variable that represents all of the possible values of a test statistic under the null hypothesis \(H_0\). If we define the \(p\)-values that correspond to \(T\) as the random variable \(P\), then a realisation \(p\) of \(P\) for some observed test statistic \(t\) is given by:

\[ \begin{align*} p &= \mathbb{P}(T \ge t | H_0)\\ &= 1 - \mathbb{P}(T < t | H_0) \end{align*} \]

Note here that \(\mathbb{P}(T < t | H_0)\) is the value of the CDF of \(T\) at \(t\) under the null hypothesis \(H_0\). Hence, we can represent the above as follows:

\[P = 1 - F_T(T)\]

where, \(F_T(T)\) denotes the CDF of \(T\). This completes the second part.

Let us now define a random variable \(Y\), where

\[Y = F_T(T)\]

If \(F_Y(Y)\) represents the cumulative distribution function of \(Y\), then

\[ \begin{align} F_Y(y) &= \mathbb{P}(Y \le y)\nonumber\\ &= \mathbb{P}(F_T(T) \le y)\nonumber\\ &= \mathbb{P}(T \le F_T^{-1}(y))\\ &= F_T(F_T^{-1}(y))\nonumber\\ &= y\nonumber \end{align} \]

Here, \(F_T^{-1}\) is the quantile function, and since the cumulative distribution function is non-decreasing, we have the inequality between the quantiles in equation (1).

From the first part, we can therefore say that the random variable \(Y\) has a continuous uniform distribution with support \([0,1]\). Finally, since \(P = 1 - Y\), from the second part, we can say that \(P\) also has a continuous uniform distribution with support \([0,1]\). This concludes the proof.

History

I used to work with data that are concerned with the effects of genes on certain phenotypes. This requires running several null hypothesis tests for each gene-phenotype combination. Because \(p\)-value adjustment using Bonferroni correction is too conservative, I was interested in understanding the False Discovery Rate (FDR) approach for \(p\)-value adjustment.

I was reading the brilliant paper Statistical significance for genomewide studies by John D. Storey and Robert Tibshirani [PNAS, 100(16), August 2003, pp. 9440-9445], and came across the above assertion. Thus began my quest to understand the assertion. The following links helped me understand the proof, although I found each insufficient to completely understand what is going on. I hope this note fills the gap.