Empirical Bayes method

In statistics, empirical Bayes methods are a class of methods which use empirical data to evaluate / approximate the conditional probability distributions that arise from Bayes' theorem. These methods allow one to estimate quantities (probabilities, averages, etc.) about an individual member of a population by combining information from empirical measurements on the individual and on the entire population.

Empirical Bayes methods involve:


 * An "underlying" probability distribution of some unobservable quantity is assigned to each member of a statistical population. This quantity is a random variable if a member of the population is chosen at random.  The probability distribution of this random variable is not known, and is thought of as a property of the population.


 * An observable quantity assigned to each member of the population. When a random sample is taken from the population, it is desired first to estimate the "underlying" probability distribution, and then to estimate the value of the unobservable quantity assigned to each member of the sample.

Introduction
In the Bayesian approach to statistics, we consider the problem of estimating some probability (such as a future outcome or a noisy measurement), based on measurements of our data, a model for these measurements, and some model for our prior beliefs about the system. Let us consider a standard two-stage model, where we write our data measurements as a vector $$y = \{y_1, y_2, \dots, y_n\} $$, and our prior beliefs as some vector of random unknowns $$\theta$$. We assume we can model our measurements with a conditional probability distribution (or likelihood ) $$\rho(y|\theta)$$, and also the prior as $$\rho(\theta|\eta)$$, where $$\eta$$ is some hyper-parameter. For example, we might choose $$\rho(y|\theta)$$ to be a binomial distribution, and $$\rho(\theta|\eta)$$ as a Beta distribution (the conjugate prior). Empirical Bayes then employs the complete set of empirical data to make inferences about the prior $$\theta$$, and then plugs this into the likelihood $$\rho(y|\theta)$$ to make estimates for future outcomes of individual measurements.

To see this in action, use Bayes' theorem to write an expression for the posterior probability for $$\theta$$ as


 * $$\rho(\theta|y)={ {\rho(y|\theta)\rho(\theta|\eta)} \over {\int {\rho(y|\theta)\rho(\theta|\eta)\, d\theta}}}$$

and let us define the marginal (or normalizing constant) as


 * $$m(y|\eta)={\int {\rho(y|\theta)\rho(\theta|\eta)\, d\theta}}$$

Empirical Bayes (EB) combines Bayesianism and frequentist approaches to estimation. EB approximates the marginal (and/or full posterior) distribution (with point estimation, maximum likelihood estimation (MLE), squared error loss (SEL), Monte Carlo/numerical integration, etc.) and then estimates the approximate marginal using the empirical data (frequency probability). EB takes several forms, including non-parameteric and parameteric forms. We describe a few common examples below.

Robbins method (1956): non-parameteric empirical Bayes (NPEB)
We consider a case of compound sampling, where probability for each $$\rho(y_i|\theta_i)$$ is specified by a Poisson distribution,


 * $$\rho(y_i|\theta_i)={{\theta_i}^{y_i} e^{-\theta_i} \over {y_i}!}$$

while the prior is unspecified except that it is also i.i.d (call it $$G(\Theta)$$). Compound sampling arises in a variety of statistical estimation problems, such as accident rates, clinical trials, etc. We simply seek a point estimate of $$\theta_i$$. Because the prior is unspecified, we seek a non-parametric estimate of the posterior. We may write a point Bayes estimate for the prior (assuming squared error loss (SEL)) as (see Carlin and Louis, Sec. 3.2 and Appendix B):
 * $$E(\theta_i\mid Y=y_i)={(y_i+1)m_G(Y=y_i+1) \over m_G(Y=y_i)}.\qquad\qquad\qquad $$

Proof:

First, we show that the point estimate for the prior is the estimated mean of the posterior. Write the posterior risk for a point estimate for $$\theta = a$$ under squared error loss (SEL) (or loss function) as


 * $$\rho(G,a) = \int (\theta-a)^2g(\theta|y)\,d\theta.$$

To find the minimum error, set the derivative with respect to a equal to zero


 * $$\frac{ \partial }{\partial a}[\rho(G,a)] = \int -2(\theta-a)g(\theta|y)\,d\theta = 0.$$

Solving for a yields


 * $$a = \int \theta g(\theta|y)\, d\theta = E(\theta|y).$$

A quick check shows that the second derivative is greater than zero, indicating a true minimum.


 * $$\frac{ \partial^2 }{\partial a^2}[\rho(G,a)] = 2 \int g(\theta|y) d\theta = 2.$$

Note that had we chosen a absolute error loss, the point estimate would instead be the median, and with 0-1 error loss, it would be the mode (definitions pending).

Second, we wish to show that the point estimate is a simple ratio of the estimated marginals for the specific case of the Poisson distribution. Write the point estimate for the prior as:


 * $$E(\theta_i|y_i) = {\int (\theta^{y+1} e^{-\theta} / {y_i}!)\,dG(\theta) \over {\int (\theta^y e^{-\theta} / {y_i}!)\,dG(\theta}) }.$$

Multiply the expression by $$({y_i}+1)/({y_i}+1)$$ and use Bayes definition of the marginal to obtain


 * $$ E(\theta_i|y_i)= {{(y_i + 1) m_G(y_i + 1) }\over {m_G(y_i)}}.$$

To take advantage of this, Robbins (1955) suggested estimating the marginals with their empirical frequencies Yd, yielding the fully non-parametric estimate as:


 * $$ E(\theta_i|y_i)= (y_i + 1) { {\#(Y_d = y_i + 1)} \over {\#( Y_d = y_i)} }$$

(see also Good-Turing frequency estimation).

Example: Accident rates

Suppose each customer of an insurance company has an "accident rate" &Theta; and is insured against "accidents"; the probability distribution of $$\Theta$$ is the "underlying" distribution, and is unknown. The number of "accidents" suffered by each customer in a specified baseline time period has a Poisson distribution whose expected value is the particular customer's "accident rate". That number of "accidents" is the observable quantity. A crude way to estimate the underlying probability distribution of the "accident rate" &Theta; is to estimate the proportion of members of the whole population suffering 0, 1, 2, 3, ... accidents during the specified time period to be equal to the corresponding proportion in the observed random sample. Having done so, it is then desired the "accident rate" of each customer in the sample. One may use the conditional expected value of the "accident rate" &Theta; given the observed number X of "accidents" during the baseline period.

Thus, if a customer suffers six "accidents" during the baseline period, that customer's estimated "accident rate" is 7 &times; [the proportion of the sample who suffered 7 "accidents"] / [the proportion of the sample who suffered 6 "accidents"].

Parametric empirical Bayes
If the likelihood and its prior take on simple parametric forms (such as 1- or 2-dimensional likelihood functions with simple conjugate priors), then the empirical Bayes problem is only to estimate the marginal $$m(y|\eta)$$ and the hyperparameters $$\eta$$ using the complete set of empirical measurements. For example, one common approach, called (parametric Empirical Bayes point estimation) is to approximate the marginal using the maximum likelihood estimate (MLE), or a Moments expansion, which allows one to express the hyperparameters $$\eta$$ in terms of the empirical mean and variance. This simplified marginal allows one to plug in the empirical averages into a point estimate for the prior $$\theta$$. The resulting equation for the prior $$\theta$$ is greatly simplified, as shown below.

There are several common Parameteric Empirical Bayes models, including the Poisson-Gamma model (below), the Beta-binomial model, the Gaussian-Gaussian model, the multinomial-Dirichlet model, as well specific models for Bayesian linear regression (see below) and Bayesian multivariate linear regression. More advanced approaches include hierarchial Bayesian models and Bayesian mixture models.

Poisson-Gamma model
For example, in the example above, let the likelihood be a Poisson distribution, and let the prior now be specified by the conjugate prior, which is a Gamma distribution ($$G(\alpha,\beta)$$) (where $$\eta = (\alpha,\beta)$$):


 * $$ \rho(\theta|\alpha,\beta) =  \frac{\theta^{\alpha-1}\, e^{-\theta / \beta} }{\beta^{\alpha} \Gamma(\alpha)}  \ \mathrm{for}\ \theta > 0, \alpha > 0, \beta > 0 \,\!$$

It is straightforward to show the posterior is also a Gamma distribution. Write


 * $$ \rho(\theta|y) \propto \rho(y|\theta) \rho(\theta|\alpha, \beta) $$

where we have omitted the marginal since it does not depend explicitly on $$\theta$$. Expanding terms which do depend on $$\theta$$ gives the posterior as:


 * $$ \rho(\theta|y) \propto (\theta^{y}\, e^{-\theta}) (\theta^{\alpha-1}\, e^{-\theta / \beta}) = \theta^{y+ \alpha -1}\, e^{- \theta (1+1 / \beta)} $$

So we see that the posterior density is also a Gamma distribution $$G(\alpha',\beta')$$, where $$\alpha' = y + \alpha$$, and $$\beta' = (1+1 / \beta)^{-1}$$. Also notice that the marginal is simply the integral of the posterior over all $$\Theta$$, which turns out to be a negative binomial distribution.

To apply Empirical Bayes, we will approximate the marginal using the maximum likelihood estimate (MLE). But since the posterior is a Gamma distribution, the MLE of the marginal turns out to be just the mean the of posterior, which is the point estimate $$E(\theta|y)$$ we need. Recalling that the mean $$\mu$$ of a Gamma distribition $$G(\alpha', \beta')$$ is simply $$\alpha' \beta'$$, we have


 * $$ E(\theta|y) = \alpha' \beta' = \frac{\bar{y}+\alpha}{1+1 / \beta} = \frac{\beta}{1+\beta}\bar{y} + \frac{1}{1+\beta} (\alpha \beta) $$

To obtain the values of $$\alpha$$ and $$\beta$$, Empirical Bayes prescribes estimating mean $$\alpha\beta$$ and variance $$\alpha\beta^2$$ using the complete set of empirical data.

The resulting point estimate $$ E(\theta|y) $$ is therefore like a weighted average of the sample mean $$\bar{y}$$ and the prior mean $$\mu = \alpha\beta$$. This turns out to be a general feature of Empirical Bayes; the point estimates for the prior (i.e mean) will look like a weighted averages of the sample estimate and the prior estimate. (Likewise for estimates of the variance).


 * to be continued