April 14, 2016

In this note, we show that the $$p$$-values are uniformly distributed under the null hypothesis. We shall do this in three parts. First, we specify the properties of a continuous uniform distribution. We then define $$p$$-values as a random variable that corresponds to a test statistic under a null hypothesis. Finally, we use the probability integral transform to show that this random variable has a continuous uniform distribution.

If $$X$$ is a random variable with a continuous uniform distribution with support defined by parameters $$a$$ and $$b$$ (the minimum and maximum values of $$X$$), then its cumulative distribution function (CDF) is given by:

$F_X(x) = \begin{cases} 0 &\textrm{for } x < a\\ \frac{x-a}{b-a} &\textrm{for } a \le x < b\\ 1 &\textrm{for } x \ge b\\ \end{cases}$

where, $$x$$ is a realisation of $$X$$. Hence, the CDF for a random variable $$X$$ with a continuous uniform distribution with support $$[0,1]$$ is given by $$F_X(x) = x$$. This is the first part.

To state the second part, let $$T$$ be a random variable that represents all of the possible values of a test statistic under the null hypothesis $$H_0$$. If we define the $$p$$-values that correspond to $$T$$ as the random variable $$P$$, then a realisation $$p$$ of $$P$$ for some observed test statistic $$t$$ is given by:

\begin{align*} p &= \mathbb{P}(T \ge t | H_0)\\ &= 1 - \mathbb{P}(T < t | H_0) \end{align*}

Note here that $$\mathbb{P}(T < t | H_0)$$ is the value of the CDF of $$T$$ at $$t$$ under the null hypothesis $$H_0$$. Hence, we can represent the above as follows:

$P = 1 - F_T(T)$

where, $$F_T(T)$$ denotes the CDF of $$T$$. This completes the second part.

Let us now define a random variable $$Y$$, where

$Y = F_T(T)$

If $$F_Y(Y)$$ represents the cumulative distribution function of $$Y$$, then

\begin{align} F_Y(y) &= \mathbb{P}(Y \le y)\nonumber\\ &= \mathbb{P}(F_T(T) \le y)\nonumber\\ &= \mathbb{P}(T \le F_T^{-1}(y))\\ &= F_T(F_T^{-1}(y))\nonumber\\ &= y\nonumber \end{align}

Here, $$F_T^{-1}$$ is the quantile function, and since the cumulative distribution function is non-decreasing, we have the inequality between the quantiles in equation (1).

From the first part, we can therefore say that the random variable $$Y$$ has a continuous uniform distribution with support $$[0,1]$$. Finally, since $$P = 1 - Y$$, from the second part, we can say that $$P$$ also has a continuous uniform distribution with support $$[0,1]$$. This concludes the proof.

History

I work with data that are concerned with the effects of genes on certain phenotypes. This requires running several null hypothesis tests for each gene-phenotype combination. Because $$p$$-value adjustment using Bonferroni correction is too conservative, I was interested in understanding the False Discovery Rate (FDR) approach for $$p$$-value adjustment.

I was reading the brilliant paper Statistical significance for genomewide studies by John D. Storey and Robert Tibshirani [PNAS, 100(16), August 2003, pp. 9440-9445], and came across the above assertion. Thus began my quest to understand the assertion. The following links helped me understand the proof, although I found each insufficient to completely understand what is going on. I hope this note fills the gap.