Least squares

In regression analysis, least squares, also known as ordinary least squares analysis, is a method for linear regression that determines the values of unknown quantities in a statistical model by minimizing the sum of the residuals (the difference between the predicted and observed values) squared. This method was first described by Carl Friedrich Gauss around 1794, close to the turn of the 19th century (Linear Algebra With Applications, 3rd Edition, by Otto Bretscher). Today, this method is available in most statistical software packages. The least-squares approach to regression analysis has been shown to be optimal in the sense that it satisfies the Gauss-Markov theorem.

A related method is the least mean squares (LMS) method. It occurs when the number of measured data is 1 and the gradient descent method is used to minimize the squared residual. LMS is known to minimize the expectation of the squared residual, with the smallest number of operations per iteration). However, it requires a large number of iterations to converge.

Furthermore, many other types of optimization problems can be expressed in a least squares form, by either minimizing energy or maximizing entropy. Also, fundamental problems such as spectrum computation can be solved using Least-squares spectral analysis as a superior alternative to Fourier analysis for analyzing long incomplete records such as most natural datasets.

Context
The method of least squares grew out of the fields of astronomy and geodesy as scientists and mathematicians sought to provide solutions to the challenges of navigating the Earth's oceans during the Age of Exploration. The accurate description of the behavior of celestial bodies was key to enabling ships to sail in open seas where before sailors had to rely on land sightings to determine the positions of their ships.

The method was the culmination of several realizations that took place during the course of the 18th Century :


 * The combination of different observations taken under the same conditions as opposed to simply trying one's best to observe and record a single observation accurately. This approach was notably used by Tobias Mayer while studying the librations of the moon.
 * The combination of different observations as being the best estimate of the true value; errors decrease with aggregation rather than increase, perhaps first expressed by Roger Cotes.
 * The combination of different observations taken under different conditions as notably performed by Roger Joseph Boscovich in his work on the shape of the earth and Pierre-Simon Laplace in his work in explaining the differences in motion of Jupiter and Saturn.
 * The development of a criterion that can be evaluated to determine when the solution with the minimum error has been achieved, developed by Laplace in his Method of Situation.

The method itself


In 1795, Carl Friedrich Gauss, at the age of 18, is credited with developing the fundamentals of the basis for least-squares analysis. However, as with many of his discoveries, he did not publish them. The strength of his method was demonstrated in 1801, when it was used to predict the future location of the newly discovered asteroid Ceres.

On January 1st, 1801, the Italian astronomer Giuseppe Piazzi had discovered the asteroid Ceres and had been able to track its path for 40 days before it was lost in the glare of the sun. Based on this data, it was desired to determine the location of Ceres after it emerged from behind the sun without solving the complicated Kepler's nonlinear equations of planetary motion. The only predictions that successfully allowed the Hungarian astronomer Franz Xaver von Zach to relocate Ceres were those performed by the 24-year-old Gauss using least-squares analysis.

However, Gauss did not publish the method until 1809, when it appeared in volume two of his work on celestial mechanics, Theoria Motus Corporum Coelestium in sectionibus conicis solem ambientium.

The idea of least-squares analysis was independently formulated by the Frenchman Adrien-Marie Legendre in 1805 and the American Robert Adrain in 1808.

In 1829, Gauss was able to state that the least-squares approach to regression analysis is optimal in the sense that in a linear model where the errors have a mean of zero, are uncorrelated, and have equal variances, the best linear unbiased estimators of the coefficients is the least-squares estimators. This result is known as the Gauss-Markov theorem.

Problem statement
The objective consists of adjusting a model function to best fit a data set. The chosen model function has adjustable parameters. The data set consist of n points $$(y_i,\vec{x}_i)$$ with $$i = 1, 2,\dots, n$$. The model function has the form $$y=f(\vec{x},\vec{a})$$, where $$ y $$ is the dependent variable, $$\vec{x}$$ are the independent variables, and $$\vec{a}$$ are the model adjustable parameters. We wish to find the parameter values such that the model best fits the data according to a defined error criterion. The least sum square method minimizes the sum square error equation $$ S = \sum_{i=1}^n (y_i - f(\vec{x}_i,\vec{a}))^2 $$ with respect to the adjustable parameters $$\vec{a}$$.

For an example, the data is height measurements over a surface. We choose to model the data by a plane with parameters for plane mean height, plane tip angle, and plane tilt angle. The model equation is then $$ y = f ( x_1, x_2 ) = a_1 + a_2 x_1 + a_3 x_2 $$, the independent variables are $$ \vec{x}=(x_1,x_2)$$, and the adjustable parameters are $$\vec{a}=(a_1,a_2,a_3)$$.

Solving the least squares problem
Least square optimization problems can be divided into linear and non-linear problems. The linear problem has a closed form solution. The optimization problem is said to be a linear optimization problem if the first order partial derivatives of S with respect to the parameters $$ \vec{a} $$ results in a set of equations that is linear in the parameter variables. The general, non-linear, unconstrained optimization problem has no closed form solution. In this case recursive methods, such as Newton's method, combined with the gradient descent method, or specialized methods for least squares analysis, such as the Gauss-Newton algorithm or the Levenberg-Marquardt algorithm can be used.

Least squares and regression analysis
In regression analysis, one replaces the relation


 * $$f(x_i)\approx y_i$$

by


 * $$f(x_i) = y_i + \varepsilon_i,$$

where the noise term &epsilon; is a random variable with mean zero. Note that we are assuming that the $$x$$ values are exact, and all the errors are in the $$y$$ values. Again, we distinguish between linear regression, in which case the function f is linear in the parameters to be determined (e.g., f(x) = ax2 + bx + c), and nonlinear regression. As before, linear regression is much simpler than nonlinear regression. (It is tempting to think that the reason for the name linear regression is that the graph of the function f(x) = ax + b is a line. But fitting a curve like f(x) = ax2 + bx + c when estimating a, b, and c by least squares, is an instance of linear regression because the vector of least-square estimates of a, b, and c is a linear transformation of the vector whose components are f(xi) + &epsilon;i.

Parameter estimates
By recognizing that the $$y_i = \alpha + \beta x_i + \varepsilon_i $$ regression model is a system of linear equations we can express the model using data matrix X, target vector Y and parameter vector $$\delta$$. The ith row of X and Y will contain the x and y value for the ith data sample. Then the model can be written as


 * $$ \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}= \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots\\ 1 & x_n \end{bmatrix} \begin{bmatrix} \alpha \\ \beta \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n \end{bmatrix} $$

which when using pure matrix notation becomes
 * $$Y = X \delta + \varepsilon \,$$

where &epsilon; is normally distributed with expected value 0 (i.e., a column vector of 0s) and variance &sigma;2 In, where In is the n&times;n identity matrix.

The least-squares estimator for $$\delta$$ is


 * $$\widehat{\delta} = (X^T X)^{-1}\; X^T Y \,$$

(where XT is the transpose of X) and the sum of squares of residuals is


 * $$Y^T (I_n - X (X^T X)^{-1} X^T)\, Y.$$

One of the properties of least-squares is that the matrix $$X\widehat{\delta}$$ is the orthogonal projection of Y onto the column space of X.

The fact that the matrix X(XTX)&minus;1XT is a symmetric idempotent matrix is incessantly relied on in proofs of theorems. The linearity of $$\widehat{\delta}$$ as a function of the vector Y, expressed above by saying


 * $$\widehat{\delta} = (X^TX)^{-1}X^TY,\,$$

is the reason why this is called "linear" regression. Nonlinear regression uses nonlinear methods of estimation.

The matrix In &minus; X (XT X)&minus;1 XT that appears above is a symmetric idempotent matrix of rank n &minus; 2. Here is an example of the use of that fact in the theory of linear regression. The finite-dimensional spectral theorem of linear algebra says that any real symmetric matrix M can be diagonalized by an orthogonal matrix G, i.e., the matrix G&prime;MG is a diagonal matrix. If the matrix M is also idempotent, then the diagonal entries in G&prime;MG must be idempotent numbers. Only two real numbers are idempotent: 0 and 1. So In &minus; X(XTX)-1XT, after diagonalization, has n &minus; 2 1s and two 0s on the diagonal. That is most of the work in showing that the sum of squares of residuals has a chi-square distribution with n&minus;2 degrees of freedom.

Regression parameters can also be estimated by Bayesian methods. This has the advantages that


 * confidence intervals can be produced for parameter estimates without the use of asymptotic approximations,
 * prior information can be incorporated into the analysis.

Suppose that in the linear regression


 * $$ y = \alpha + \beta x + \varepsilon \, $$

we know from domain knowledge that alpha can only take one of the values {&minus;1, +1} but we do not know which. We can build this information into the analysis by choosing a prior for alpha which is a discrete distribution with a probability of 0.5 on &minus;1 and 0.5 on +1. The posterior for alpha will also be a discrete distribution on {&minus;1, +1}, but the probability weights will change to reflect the evidence from the data.

In modern computer applications, the actual value of $$ \beta$$ is calculated using the QR decomposition or slightly more fancy methods when $$X^TX$$ is near singular. The code for the MATLAB backslash function, " ", is an excellent example of a robust method.

Summarizing the data
We sum the observations, the squares of the Xs and the products XY to obtain the following quantities.


 * $$S_X = x_1 + x_2 + \cdots + x_n \,$$


 * $$S_Y = y_1 + y_2 + \cdots + y_n \,$$


 * $$S_{XX} = x_1^2 + x_2^2 + \cdots + x_n^2 \,$$


 * $$S_{XY} = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n. \,$$

Estimating beta (the slope)
We use the summary statistics above to calculate $$\widehat\beta$$, the estimate of &beta;.


 * $$\widehat\beta = {n S_{XY} - S_X S_Y \over n S_{XX} - S_X S_X}. \,$$

Estimating alpha (the intercept)
We use the estimate of &beta; and the other statistics to estimate &alpha; by:


 * $$\widehat\alpha = {S_Y - \widehat\beta S_X \over n}. \,$$

A consequence of this estimate is that the regression line will always pass through the "center" $$(\bar{x},\bar{y}) = (S_X/n, S_Y/n)$$.

Limitations
Least squares estimation for linear models is notoriously non-robust to outliers. If the distribution of the outliers is skewed, the estimates can be biased. In the presence of any outliers, the least squares estimates are inefficient and can be extremely slow. When outliers occur in the data, methods of robust regression are more appropriate.