AdaBoost

AdaBoost, short for Adaptive Boosting, is a machine learning algorithm, formulated by Yoav Freund and Robert Schapire. It is a meta-algorithm, and can be used in conjunction with many other learning algorithms to improve their performance. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. Otherwise, it is less susceptible to the overfitting problem than most learning algorithms.

AdaBoost calls a weak classifier repeatedly in a series of rounds $$ t = 1,\ldots,T$$. For each call a distribution of weights $$D_{t}$$ is updated that indicates the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased (or alternative, the weights of each correctly classified example are decreased), so that the new classifier focuses more on those examples.

The algorithm for the binary classification task
Given: $$(x_{1},y_{1}),\ldots,(x_{m},y_{m})$$ where $$x_{i} \in X,\, y_{i} \in Y = \{-1, +1\}$$

Initialise $$D_{1}(i) = \frac{1}{m}, i=1,\ldots,m.$$

For $$t = 1,\ldots,T$$:

$$D_{t+1}(i) = \frac{ D_{t}(i) \, e^{- \alpha_{t} y_{i} h_{t}(x_{i})} }{ Z_{t} }$$ where $$Z_{t}$$ is a normalization factor (chosen so that $$D_{t+1}$$ will be a probability distribution, i.e. sum one over all x).
 * Find the classifier $$h_{t} : X \to \{-1,+1\}$$ that minimizes the error with respect to the distribution $$D_{t}$$: $$h_{t} = \arg \min_{h_{j} \in \mathcal{H}} \epsilon_{j}$$, where $$ \epsilon_{j} = \sum_{i=1}^{m} D_{t}(i)[y_i \ne h_{j}(x_{i})]$$
 * Prerequisite: $$\epsilon_{t} < 0.5$$, otherwise stop.
 * Choose $$\alpha_{t} \in \mathbf{R}$$, typically $$\alpha_{t}=\frac{1}{2}\textrm{ln}\frac{1-\epsilon_{t}}{\epsilon_{t}}$$ where $$\epsilon_{t}$$ is the weighted error rate of classifier $$h_{t}$$.
 * Update:

Output the final classifier:

$$H(x) = \textrm{sign}\left( \sum_{t=1}^{T} \alpha_{t}h_{t}(x)\right)$$

The equation to update the distribution $$D_{t}$$ is constructed so that:

$$e^{- \alpha_{t} y_{i} h_{t}(x_{i})} \begin{cases} <1, & y(i)=h_{t}(x_{i}) \\ >1, & y(i) \ne h_{t}(x_{i}) \end{cases}$$

Thus, after selecting an optimal classifier $$h_{t} \,$$ for the distribution $$D_{t} \,$$, the examples $$x_{i} \,$$ that the classifier $$h_{t} \,$$ identified correctly are weighted less and those that it identified incorrectly are weighted more. Therefore, when the algorithm is testing the classifiers on the distribution $$D_{t+1} \,$$, it will select a classifier that better identifies those examples that the previous classifer missed.

Statistical Understanding of Boosting
Boosting can be seen as minimization of a convex loss function over a convex set of functions. Specifically, the loss being minimized is the exponential loss


 * $$\sum_i e^{-y_i f(x_i)}$$

and we are seeking a function


 * $$f = \sum_t \alpha_t h_t$$