P-value

Overview
In statistical hypothesis testing, the p-value is the probability of obtaining a result at least as extreme as a given data point, assuming the data point was the result of chance alone. The fact that p-values are based on this assumption is crucial to their correct interpretation.

Coin flipping example
For example, say an experiment is performed to determine if a coin flip is fair (50% chance of landing heads or tails), or unfairly biased, either toward heads (&gt; 50% chance of landing heads) or toward tails (&lt; 50% chance of landing heads). Since we consider both biased alternatives, a two-tailed test is performed. The null hypothesis is that the coin is fair, and that any deviations from the 50% rate can be ascribed to chance alone. Suppose that the experimental results show the coin turning up heads 14 times out of 20 total flips. The p-value of this result would be the chance of a fair coin landing on heads at least 14 times out of 20 flips (as larger values in this case are also less favorable to the null hypothesis of a fair coin) or landing on tails at most 6 times out of 20 flips. In this case the random variable T has a binomial distribution. The probability that 20 flips of a fair coin would result in 14 or more heads is 0.0577. Since this is a two-tailed test, the probability that 20 flips of the coin would result in 14 or more heads or 6 or less heads is 0.0577 x 2 = 0.115.

Generally, the smaller the p-value, the more people there are who would be willing to say that the results came from a biased coin.

Interpretation
Generally, one rejects the null hypothesis if the p-value is smaller than or equal to the significance level, often represented by the Greek letter &alpha; (alpha). If the level is 0.05, then the results are only 5% likely to be as extraordinary as just seen, given that the null hypothesis is true.

In the above example, the calculated p-value exceeds 0.05, and thus the null hypothesis - that the observed result of 14 heads out of 20 flips can be ascribed to chance alone - is not rejected. Such a finding is often stated as being "not statistically significant at the 5% level".

However, had a single extra head been obtained, the resulting p-value would be 0.02. This time the null hypothesis - that the observed result of 15 heads out of 20 flips can be ascribed to chance alone - is rejected. Such a finding would be described as being "statistically significant at the 5% level".

Critics of p-values point out that the criterion used to decide "statistical significance" is based on the somewhat arbitrary choice of level (often set at 0.05). A proposed replacement for the p-value is p-rep.

Frequent misunderstandings
There are several common misunderstandings about p-values.


 * 1) The p-value is not the probability that the null hypothesis is true (claimed to justify the "rule" of considering as significant p-values closer to 0 (zero)).
 * In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity. This is the Jeffreys-Lindley paradox.
 * 1) The p-value is not the probability that a finding is "merely a fluke" (again, justifying the "rule" of considering small p-values as "significant").
 * As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot simultaneously be used to gauge the probability of that assumption being true.
 * 1) The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacy.
 * 2) The p-value is not the probability that a replicating experiment would not yield the same conclusion.
 * 3) 1 &minus; (p-value) is not the probability of the alternative hypothesis being true (see (1)).
 * 4) The significance level of the test is not determined by the p-value.
 * The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
 * 1) The p-value does not indicate the size or importance of the observed effect (compare with effect size).

Additional reading

 * Dallal GE (2007) Historical background to the origins of p-values and the choice of 0.05 as the cut-off for significance
 * Hubbard R, Armstrong JS (2005) Historical background on the widespread confusion of the p-value (PDF)
 * Fisher's method for combining independent tests of significance using their p-values