Statistical power

The power of a statistical test is the probability that the test will reject a false null hypothesis (that it will not make a Type II error). As power increases, the chances of a Type II error decrease, and vice versa. The probability of a Type II error is referred to as the false negative rate (β). Therefore power is equal to 1 &minus; β.

Power analysis can either be before (a priori) or after (post hoc) data is collected. A priori power analysis is conducted prior to the conducting of research and is typically used to determine an appropriate sample size to achieve adequate power. Post-hoc power analysis is conducted after a study has been conducted and uses the obtained sample size and effect size to determine what the power was in the study assuming the effect size in the sample size is equal to the population effect size.

Statistical tests attempt to use data from samples to determine if differences or similarities exist in a population. For example, to test the null hypothesis that the mean scores of men and women on a test do not differ, samples of men and women are drawn, the test is administered to them, and the mean score of one group is compared to that of the other group using a statistical test. The power of the test is the probability that the test will find a statistically significant difference between men and women, as a function of the size of the true difference between those two populations. Despite the use of random samples, which will tend to mirror the population due to mathematical properties such as the central limit theorem, there is always a chance that the samples will appear to support or refute a tested hypothesis when the reality is the opposite. This risk is quantified as the power of the test and as the statistical significance level used for the test.

Statistical power depends on:
 * the statistical significance criterion used in the test
 * the size of the difference or the strength of the similarity (that is, the effect size) in the population
 * the sensitivity of the data.

A significance criterion is a statement of how unlikely a result must be, if the null hypothesis is true, to be considered significant. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the difference must be less than 0.05, and so on. One way to increase the power of a test is to increase (that is, weaken) the significance level. This increases the chance of obtaining a statistically significant result (rejecting the null hypothesis) when the null hypothesis is false, that is, reduces the risk of a Type II error. But it also increases the risk of obtaining a statistically significant result when the null hypothesis is in fact true; that is, it increases the risk of a Type I error.

Calculating the power requires first specifying the effect size you want to detect. The greater the effect size, the greater the power.

Sensitivity can be increased by using statistical controls, by increasing the reliability of measures (as in psychometric reliability), and by increasing the size of the sample. Increasing sample size is the most commonly used method for increasing statistical power.

Although there are no formal standards for power, most researchers who assess the power of their tests use 0.80 as a standard for adequacy.

A common misconception by those new to statistical power is that power is a property of a study or experiment. In reality any statistical result that has a p-value has an associated power. For example, in the context of a single multiple regression, there will be a different level of statistical power associated with the overall r-square and for each of the regression coefficients. When determining an appropriate sample size for a planned study, it is important to consider that power will vary across the different hypotheses.

There are times when the recommendations of power analysis regarding sample size will be inadequate. Power analysis is appropriate when the concern is with the correct acceptance or rejection of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail). However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. These and other considerations often result in the true but somewhat simplistic recommendation that when it comes to sample size, "More is better!"

However, huge sample sizes can lead to statistical tests becoming so powerful that the null hypothesis is always rejected for real data. This is a problem in studies of differential item functioning.

Funding agencies, ethics boards and research review panels frequently request that a researcher perform a power analysis. The argument is that if a study is inadequately powered, there is no point in completing the research.