Model selection

Model selection is the task of selecting a statistical model from a set of potential models, given data. In its most basic forms, this is one of the fundamental tasks of scientific inquiry. Determining the principle behind a series of observations is often linked directly to a mathematical model predicting those observations. For example, when Galileo performed his inclined plane experiments, his demonstration of the motion of the balls fit the parabola predicted by his model.

Of the endless number of possible models that could have produced the data, how can one even begin to choose the correct model? The mathematical approach commonly taken decides between a series of given models, it is still necessary to choose this set of models before beginning. Often simple models such as polynomials or quadrics are used as a starting point. Burnham and Anderson (2002) emphasize the importance of selecting models based on sound scientific principles modeling the underlying data throughout their book on model selection.

Once the set of possible models are selected, the mathematical analysis allows us to determine the best of these models. What is meant by best is controversial. A good model selection technique will balance goodness of fit and complexity. More complex models will be better able to adapt their shape to fit the data (for example, a sixth-order polynomial can exactly fit six points), but the additional parameters may not represent anything useful. (Perhaps those six points are really just randomly distributed about a line.) Goodness of fit is generally determined in the chi-square sense. The complexity is generally measured by counting the number of free parameters in the model.

Model selection techniques can be considered as estimators of some physical quantity, such as the probability of the model producing the given data. The bias and variance are both important measures of the quality of this estimator.

Asymptotic efficiency is also often considered.

A standard example of model selection is that of curve fitting, where, given a set of points and other background knowledge (e.g. points are a result of i.i.d. samples), we must select a function that describes the best curve.

Model selection methods

 * Akaike information criterion
 * Bayesian information criterion
 * Bayesian model comparison
 * Mallows' Cp
 * Deviance information criterion
 * Geometric information criterion
 * Minimum description length
 * Minimum message length
 * Stepwise regression