Stein's example

Stein's example, sometimes referred to as Stein's phenomenon or Stein's paradox, is a surprising effect observed in decision theory and estimation theory. Simply stated, the example demonstrates that when three or more parameters are estimated simultaneously, their combined estimator is more accurate than any method which handles the parameters separately. This is surprising since the parameters and the measurements might be totally unrelated. The phenomenon is named after its discoverer, Charles Stein.

Formal statement
Let $${\boldsymbol \theta}$$ be a vector consisting of $$n \ge 3$$ unknown parameters. To estimate these parameters, a single measurement $$X_i$$ is performed for each parameter $$\theta_i$$, resulting in a vector $${\mathbf X}$$ of length $$n$$. Suppose the measurements are independent, identically distributed, Gaussian random variables, with mean $${\boldsymbol \theta}$$ and variance 1, i.e.,
 * $${\mathbf X} \sim N({\boldsymbol \theta}, I).$$

Thus, each parameter is estimated using a single noisy measurement, and each measurement is equally inaccurate.

Under such conditions, it is most intuitive (and most common) to use each measurement as an estimate of its corresponding parameter. This so-called "ordinary" decision rule can be written as
 * $${\hat \boldsymbol \theta} = {\mathbf X}.$$

The quality of such an estimator is measured by its risk function. A commonly used risk function is the mean squared error, defined as
 * $$E \left\{ \| {\boldsymbol \theta} - {\hat \boldsymbol \theta} \|^2 \right\}.$$

Surprisingly, it turns out that the "ordinary" estimator proposed above is suboptimal in terms of mean squared error. In other words, in the setting discussed here, there exist alternative estimators which always achieve lower mean squared error, no matter what the value of $${\boldsymbol \theta}$$ is.

More accurately, an estimator $${\hat \boldsymbol \theta}_1$$ is said to dominate another estimator $${\hat \boldsymbol \theta}_2$$ if, for all values of $${\boldsymbol \theta}$$, the risk of $${\hat \boldsymbol \theta}_1$$ is lower than, or equal to, the risk of $${\hat \boldsymbol \theta}_2$$. An estimator is said to be admissible if no other estimator dominates it. Thus, Stein's example can be simply stated as follows: The ordinary decision rule for estimating the mean of a multivariate Gaussian distribution is inadmissible under mean squared error risk.

Many simple, practical estimators achieve better performance than the ordinary estimator. The best-known example is the James-Stein estimator.

For a sketch of the proof of this result, see Proof of Stein's example.

Implications
Stein's example is surprising, since the "ordinary" decision rule is intuitive and commonly used. In fact, numerous methods for estimator construction, including maximum likelihood estimation, best linear unbiased estimation, least squares estimation and optimal equivariant estimation, all result in the "ordinary" estimator. Yet, as discussed above, this estimator is suboptimal.

To demonstrate the unintuitive nature of Stein's example, consider the following real-world example. Suppose we are to estimate three unrelated parameters, such as the US wheat yield for 1993, the number of spectators at the Wimbledon tennis tournament in 2001, and the weight of a randomly chosen candy bar from the supermarket. Suppose we have independent Gaussian measurements of each of these quantities. Stein's example now tells us that we will get a better estimate for the three parameters by simultaneously using the three unrelated measurements.

At first sight (or to the naïve reader) it appears that somehow we get a better estimate for US wheat yield by measuring some other unrelated statistics such as the number of spectators at Wimbeldon and the weight of a candy bar. This is of course absurd; we have not obtained a better estimate for US wheat yield alone, but we have produced an estimate for the means of all of the random variables, which has a reduced total risk. So the cost of a bad estimate in one component can be compensated by a better estimate in another component.

Resolution of the "paradox"
One may ask how the simultaneous measurement of several parameters reduces the total error of the parameters. This stems from the fact that some properties of a distribution can be estimated more accurately when multiple observations are present, even if those observations are statistically independent. For example, consider the squared norm of the parameter vector, $$\|{\boldsymbol \theta}\|^2$$. One might consider estimating this value using $$\|{\mathbf X}\|^2$$. However, the expectation of this estimate can be shown to be
 * $$E\{ \|{\mathbf X}\|^2 \} = \|{\boldsymbol \theta}\|^2 + n,$$

so that $$\|{\mathbf X}\|^2$$ tends to be an overestimate of $$\|{\boldsymbol \theta}\|^2$$. Furthermore, $$\|{\boldsymbol \theta}\|^2$$ can be estimated more accurately when more parameters are present.

It follows from the above equation that the "ordinary" estimate tends to overestimate the norm of the parameters. This can be corrected by shrinking the ordinary estimator, using, for example, the James-Stein estimator.