Part of a series on Statistics 
Regression analysis 

Models 
Estimation 
Background 
Linear least squares (LLS) is the least squares approximation of linear functions to data. It is a set of formulations for solving statistical problems involved in linear regression, including variants for ordinary (unweighted), weighted, and generalized (correlated) residuals. Numerical methods for linear least squares include inverting the matrix of the normal equations and orthogonal decomposition methods.
YouTube Encyclopedic

1/5Views:263 947436 666737 87546 788158 474

✪ Least squares approximation  Linear Algebra  Khan Academy

✪ Linear Regression  Least Squares Criterion Part 1

✪ How to calculate linear regression using least square method

✪ Least Squares Approximation  MIT 18.06SC Linear Algebra, Fall 2011

✪ Introduction to residuals and least squares regression
Transcription
Let's say I have some matrix A. Let's say it's an nbyk matrix, and I have the equation Ax is equal to b. So in this case, x would have to be a member of Rk, because we have k columns here, and b is a member of Rn. Now, let's say that it just so happens that there is no solution to Ax is equal to b. What does that mean? Let's just expand out A. I think you already know what that means. If I write a like this, a1, a2, if I just write it as its columns vectors right there, all the way through ak, and then I multiply it times x1, x2, all the way through xk, this is the same thing as that equation there. I just kind of wrote out the two matrices. Now, this is the same thing as x1 times a1 plus x2 times a2, all the way to plus xk times ak is equal to the vector b. Now, if this has no solution, then that means that there's no set of weights here on the column vectors of a, where we can get to b. Or another way to say it is, no linear combinations of the column vectors of a will be equal to b. Or an even further way of saying it is that b is not in the column space of a. No linear combination of these guys can equal to that. So let's see if we can visualize it a bit. So let me draw the column space of a. So maybe the column space of a looks something like this right here. I'll just assume it's a plane in Rn. It doesn't have to be a plane. Things can be very general, but let's say that this is the column space. This is the column space of a. Now, if that's the column space and b is not in the column space, maybe we can draw b like this. Maybe b, let's say this is the origin right there, and b just pops out right there. So this is the 0 vector. This is my vector b, clearly not in my column spaces, clearly not in this plane. Now, up until now, we would get an equation like that. We would make an augmented matrix, put in reduced row echelon form, and get a line that said 0 equals 1, and we'd say, no solution, nothing we can do here. But what if we can do better? You know, we clearly can't find a solution to this. But what if we can find a solution that gets us close to this? So what if I want to find some x, I'll call it xstar for now, where so I want to find some xstar, where A times xstar is and this is a vector as close as possible let me write this as close to b as possible. Or another way to view it, when I say close, I'm talking about length, so I want to minimize the length of let me write this down. I want to minimize the length of b minus A times xstar. Now, some of you all might already know where this is going. But when you take the difference between 2 and then take its length, what does that look like? Let me just call Ax. Ax is going to be a member of my column space. Let me just call that v. Ax is equal to v. You multiply any vector in Rk times your matrix A, you're going to get a member of your column space. So any Ax is going to be in your column space. And maybe that is the vector v is equal to A times xstar. And we want this vector to get as close as possible to this as long as it stays I mean, it has to be in my column space. But we want the distance between this vector and this vector to be minimized. Now, I just want to show you where the terminology for this will come from. I haven't given it its proper title yet. If you were to take this vector let just call this vector v for simplicity that this is equivalent to the length of the vector. You take the difference between each of the elements. So b1 minus v1, b2 minus v2, all the way to bn minus vn. And if you take the length of this vector, this is the same thing as this. This is going to be equal to the square root. Let me take the length squared, actually. The length squared of this is just going to be b1 minus v1 squared plus b2 minus v2 squared plus all the way to bn minus vn squared. And I want to minimize this. So I want to make this value the least value that it can be possible, or I want to get the least squares estimate here. And that's why, this last minute or two when I was just explaining this, that was just to give you the motivation for why this right here is called the least squares estimate, or the least squares solution, or the least squares approximation for the equation Ax equals b. There is no solution to this, but maybe we can find some xstar, where if I multiply A times xstar, this is clearly going to be in my column space and I want to get this vector to be as close to b as possible. Now, we've already seen in several videos, what is the closest vector in any subspace to a vector that's not in my subspace? Well, the closest vector to it is the projection. The closest vector to b, that's in my subspace, is going to be the projection of b onto my column space. That is the closest vector there. So if I want to minimize this, I want to figure out my xstar, where Axstar is equal to the projection of my vector b onto my subspace or onto the column space of A. Remember what we're doing here. We said Axb has no solution, but maybe we can find some x that gets us as close as possible. So I'm calling that my least squares solution or my least squares approximation. And this guy right here is clearly going to be in my column space, because you take some vector x times A, that's going to be a linear combination of these column vectors, so it's going to be in the column space. And I want this guy to be as close as possible to this guy. Well, the closest vector in my column space to that guy is the projection. So Ax needs to be equal to the projection of b on my column space. It needs to be equal to that. But this is still pretty hard to find. You saw how, you know, you took A times the inverse of A transpose A times A transpose. That's hard to find that transformation matrix. So let's see if we can find an easier way to figure out the least squares solution, or kind of our best solution. It's not THE solution. It's our BEST solution to this right here. That's why we call it the least squares solution or approximation. Let's just subtract b from both sides of this and we might get something interesting. So what happens if we take Ax minus the vector b on both sides of this equation? I'll do it up here on the right. On the lefthand side we get A times xstar. It's hard write the x and then the star because they're very similar. And we subtract b from it. We subtract our vector b. That's going to be equal to the projection of b onto our column space minus b. All I did is I subtracted b from both sides of this equation. Now, what is the projection of b minus our vector b? If we draw it right here, it's going to be this vector right let me do it in this orange color. It's going to be this right here. It's going to be that vector right there, right? If I take the projection of b, which is that, minus b, I'm going to get this vector. you we could say b plus this vector is equal to my projection of b onto my subspace. So this vector right here is orthogonal. It's actually part of the definition of a projection that this guy is going to be orthogonal to my subspace or to my column space. And so this guy is orthogonal to my column space. So I can write Axstar minus b, it's orthogonal to my column space, or we could say it's a member of the orthogonal complement of my column space. The orthogonal complement is just the set of everything, all of the vectors that are orthogonal to everything in your subspace, in your column space right here. So this vector right here that's kind of pointing straight down onto my plane is clearly a member of the orthogonal complement of my column space. Now, this might look familiar to you already. What is the orthogonal complement of my column space? The orthogonal complement of my column space is equal to the null space of a transpose, or the left null space of A. We've done this in many, many videos. So we can say that A times my least squares estimate of the equation Ax is equal to b I wrote that. So xstar is my least squares solution to Ax is equal to b. So A times that minus b is a member of the null space of A transpose. Now, what does that mean? Well, that means that if I multiply A transpose times this guy right here, times Axstar and let me, no I don't want to lose the vector signs there on the x. This is a vector. I don't want to forget that. Axstar minus b. So if I multiply A transpose times this right there, that is the same thing is that, what am I going to get? Well, this is a member of the null space of A transpose, so this times A transpose has got to be equal to 0. It is a solution to A transpose times something is equal to the 0 vector. Now. Let's see if we can simplify this a little bit. We get A transpose A times xstar minus A transpose b is equal to 0, and then if we add this term to both sides of the equation, we are left with A transpose A times the least squares solution to Ax equal to b is equal to A transpose b. That's what we get. Now, why did we do all of this work? Remember what we started with. We said we're trying to find a solution to Ax is equal to b, but there was no solution. So we said, well, let's find at least an xstar that minimizes b, that minimizes the distance between b and Axstar. And we call this the least squares solution. We call it the least squares solution because, when you actually take the length, or when you're minimizing the length, you're minimizing the squares of the differences right there. So it's the least squares solution. Now, to find this, we know that this has to be the closest vector in our subspace to b. And we know that the closest vector in our subspace to b is the projection of b onto our subspace, onto our column space of A. And so, we know that A let me switch colors. We know that A times our least squares solution should be equal to the projection of b onto the column space of A. If we can find some x in Rk that satisfies this, that is our least squares solution. But we've seen before that the projection b is easier said than done. You know, there's a lot of work to it. So maybe we can do it a simpler way. And this is our simpler way. If we're looking for this, alternately, we can just find a solution to this equation. So you give me an Ax equal to b, there is no solution. Well, what I'm going to do is I'm just going to multiply both sides of this equation times A transpose. If I multiply both sides of this equation by A transpose, I get A transpose times Ax is equal to A transpose and I want to do that in the same blue A no, that's not the same blue A transpose b. All I did is I multiplied both sides of this. Now, the solution to this equation will not be the same as the solution to this equation. This right here will always have a solution, and this right here is our least squares solution. So this right here is our least squares solution. And notice, this is some matrix, and then this right here is some vector. This right here is some vector. So long as we can find a solution here, we've given our best shot at finding a solution to Ax equal to b. We've minimized the error. We're going to get Axstar, and the difference between Axstar and b is going to be minimized. It's going to be our least squares solution. It's all a little bit abstract right now in this video, but hopefully, in the next video, we'll realize that it's actually a very, very useful concept.
Contents
Main formulations
The three main linear least squares formulations are:
 Ordinary least squares (OLS) is the most common estimator. OLS estimates are commonly used to analyze both experimental and observational data.
The OLS method minimizes the sum of squared residuals, and leads to a closedform expression for the estimated value of the unknown parameter vector β:
where is a vector whose ith element is the ith observation of the dependent variable, and is a matrix whose ij element is the ith observation of the jth independent variable. The estimator is unbiased and consistent if the errors have finite variance and are uncorrelated with the regressors:^{[1]}
 Weighted least squares (WLS) are used when heteroscedasticity is present in the error terms of the model.
 Generalized least squares (GLS) is an extension of the OLS method, that allows efficient estimation of β when either heteroscedasticity, or correlations, or both are present among the error terms of the model, as long as the form of heteroscedasticity and correlation is known independently of the data. To handle heteroscedasticity when the error terms are uncorrelated with each other, GLS minimizes a weighted analogue to the sum of squared residuals from OLS regression, where the weight for the i^{th} case is inversely proportional to var(ε_{i}). This special case of GLS is called "weighted least squares". The GLS solution to estimation problem is
Alternative formulations
Other formulations include:
 Iteratively reweighted least squares (IRLS) is used when heteroscedasticity, or correlations, or both are present among the error terms of the model, but where little is known about the covariance structure of the errors independently of the data.^{[2]} In the first iteration, OLS, or GLS with a provisional covariance structure is carried out, and the residuals are obtained from the fit. Based on the residuals, an improved estimate of the covariance structure of the errors can usually be obtained. A subsequent GLS iteration is then performed using this estimate of the error structure to define the weights. The process can be iterated to convergence, but in many cases, only one iteration is sufficient to achieve an efficient estimate of β.^{[3]}^{[4]}
 Instrumental variables regression (IV) can be performed when the regressors are correlated with the errors. In this case, we need the existence of some auxiliary instrumental variables z_{i} such that E[z_{i}ε_{i}] = 0. If Z is the matrix of instruments, then the estimator can be given in closed form as
 Total least squares (TLS)^{[5]} is an approach to least squares estimation of the linear regression model that treats the covariates and response variable in a more geometrically symmetric manner than OLS. It is one approach to handling the "errors in variables" problem, and is also sometimes used even when the covariates are assumed to be errorfree.
In addition, percentage least squares focuses on reducing percentage errors, which is useful in the field of forecasting or time series analysis. It is also useful in situations where the dependent variable has a wide range without constant variance, as here the larger residuals at the upper end of the range would dominate if OLS were used. When the percentage or relative error is normally distributed, least squares percentage regression provides maximum likelihood estimates. Percentage regression is linked to a multiplicative error model, whereas OLS is linked to models containing an additive error term.^{[6]}
In constrained least squares, one is interested in solving a linear least squares problem with an additional constraint on the solution.
Objective function
In OLS (i.e., assuming unweighted observations), the optimal value of the objective function is found by substituting in the optimal expression for the coefficient vector, can be written as:
where , the latter equality holding since is symmetric and idempotent. It can be shown from this^{[7]} that under an appropriate assignment of weights the expected value of S is m − n. If instead unit weights are assumed, the expected value of S is , where is the variance of each observation.
If it is assumed that the residuals belong to a normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a chisquared () distribution with m − n degrees of freedom. Some illustrative percentile values of are given in the following table.^{[8]}
These values can be used for a statistical criterion as to the goodness of fit. When unit weights are used, the numbers should be divided by the variance of an observation.
For WLS, the ordinary objective function above is replaced for a weighted average of residuals.
Discussion
In statistics and mathematics, linear least squares is an approach to fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system.
Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations, where the best approximation is defined as that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called linear least squares since the assumed function is linear in the parameters to be estimated. Linear least squares problems are convex and have a closedform solution that is unique, provided that the number of data points used for fitting equals or exceeds the number of unknown parameters, except in special degenerate situations. In contrast, nonlinear least squares problems generally must be solved by an iterative procedure, and the problems can be nonconvex with multiple optima for the objective function. If prior distributions are available, then even an underdetermined system can be solved using the Bayesian MMSE estimator.
In statistics, linear least squares problems correspond to a particularly important type of statistical model called linear regression which arises as a particular form of regression analysis. One basic form of such a model is an ordinary least squares model. The present article concentrates on the mathematical aspects of linear least squares problems, with discussion of the formulation and interpretation of statistical regression models and statistical inferences related to these being dealt with in the articles just mentioned. See outline of regression analysis for an outline of the topic.
Properties
If the experimental errors, , are uncorrelated, have a mean of zero and a constant variance, , the Gauss–Markov theorem states that the leastsquares estimator, , has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a normal distribution. However, for some probability distributions, there is no guarantee that the leastsquares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.
For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the leastsquares estimator of the value of that quantity. If the conditions of the Gauss–Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.
However, in the case that the experimental errors do belong to a normal distribution, the leastsquares estimator is also a maximum likelihood estimator.^{[9]}
These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.
Limitations
An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares or more generally errorsinvariables models, or rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.^{[10]}^{[11]}
In some cases the (weighted) normal equations matrix X^{T}X is illconditioned. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly illconditioned as the order of the matrix increases.^{[citation needed]} In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate.^{[citation needed]} Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of , then various techniques can be used to increase the stability of the solution. For example, see constrained least squares.
Another drawback of the least squares estimator is the fact that the norm of the residuals, is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter , e.g., a small value of .^{[citation needed]} However, since the true parameter is necessarily unknown, this quantity cannot be directly minimized. If a prior probability on is known, then a Bayes estimator can be used to minimize the mean squared error, . The least squares method is often applied when no prior is known. Surprisingly, when several parameters are being estimated jointly, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James–Stein estimator. This is an example of more general shrinkage estimators that have been applied to regression problems.
Applications
 Polynomial fitting: models are polynomials in an independent variable, x:
 Straight line: .^{[12]}
 Quadratic: .
 Cubic, quartic and higher polynomials. For regression with highorder polynomials, the use of orthogonal polynomials is recommended.^{[13]}
 Numerical smoothing and differentiation — this is an application of polynomial fitting.
 Multinomials in more than one independent variable, including surface fitting
 Curve fitting with Bsplines ^{[10]}
 Chemometrics, Calibration curve, Standard addition, Gran plot, analysis of mixtures
Uses in data fitting
The primary application of linear least squares is in data fitting. Given a set of m data points consisting of experimentally measured values taken at m values of an independent variable ( may be scalar or vector quantities), and given a model function with it is desired to find the parameters such that the model function "best" fits the data. In linear least squares, linearity is meant to be with respect to parameters so
Here, the functions may be nonlinear with respect to the variable x.
Ideally, the model function fits the data exactly, so
for all This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals
so to minimize the function
After substituting for and then for , this minimization problem becomes the quadratic minimization problem above with
and the best fit can be found by solving the normal equations.
Example
As a result of an experiment, four data points were obtained, and (shown in red in the diagram on the right). We hope to find a line that best fits these four points. In other words, we would like to find the numbers and that approximately solve the overdetermined linear system
of four equations in two unknowns in some "best" sense.
The residual, at each point, between the curve fit and the data is the difference between the right and lefthand sides of the equations above. The least squares approach to solving this problem is to try to make the sum of the squares of these residuals as small as possible; that is, to find the minimum of the function
The minimum is determined by calculating the partial derivatives of with respect to and and setting them to zero
This results in a system of two equations in two unknowns, called the normal equations, which when solved give
and the equation of the line of best fit. The residuals, that is, the differences between the values from the observations and the predicated variables by using the line of best fit, are then found to be and (see the diagram on the right). The minimum value of the sum of squares of the residuals is
More generally, one can have regressors , and a linear model
Using a quadratic model
Importantly, in "linear least squares", we are not restricted to using a line as the model as in the above example. For instance, we could have chosen the restricted quadratic model . This model is still linear in the parameter, so we can still perform the same analysis, constructing a system of equations from the data points:
The partial derivatives with respect to the parameters (this time there is only one) are again computed and set to 0:
and solved
leading to the resulting best fit model
See also
 Lineline intersection#Nearest point to nonintersecting lines, an application
 Line fitting
 Nonlinear least squares
 Regularized least squares
 Simple linear regression
 Partial least squares regression
References
 ^ Lai, T.L.; Robbins, H.; Wei, C.Z. (1978). "Strong consistency of least squares estimates in multiple regression". PNAS. 75 (7): 3034–3036. Bibcode:1978PNAS...75.3034L. doi:10.1073/pnas.75.7.3034. JSTOR 68164. PMC 392707. PMID 16592540.
 ^ del Pino, Guido (1989). "The Unifying Role of Iterative Generalized Least Squares in Statistical Algorithms". Statistical Science. 4 (4): 394–403. doi:10.1214/ss/1177012408. JSTOR 2245853.
 ^ Carroll, Raymond J. (1982). "Adapting for Heteroscedasticity in Linear Models". The Annals of Statistics. 10 (4): 1224–1233. doi:10.1214/aos/1176345987. JSTOR 2240725.
 ^ Cohen, Michael; Dalal, Siddhartha R.; Tukey, John W. (1993). "Robust, Smoothly Heterogeneous Variance Regression". Journal of the Royal Statistical Society, Series C. 42 (2): 339–353. JSTOR 2986237.
 ^ Nievergelt, Yves (1994). "Total Least Squares: StateoftheArt Regression in Numerical Analysis". SIAM Review. 36 (2): 258–264. doi:10.1137/1036055. JSTOR 2132463.
 ^ Tofallis, C (2009). "Least Squares Percentage Regression". Journal of Modern Applied Statistical Methods. 7: 526–534. doi:10.2139/ssrn.1406472. SSRN 1406472.
 ^ Hamilton, W. C. (1964). Statistics in Physical Science. New York: Ronald Press.
 ^ Spiegel, Murray R. (1975). Schaum's outline of theory and problems of probability and statistics. New York: McGrawHill. ISBN 9780585267395.
 ^ Margenau, Henry; Murphy, George Moseley (1956). The Mathematics of Physics and Chemistry. Princeton: Van Nostrand.
 ^ ^{a} ^{b} Gans, Peter (1992). Data fitting in the Chemical Sciences. New York: Wiley. ISBN 9780471934127.
 ^ Deming, W. E. (1943). Statistical adjustment of Data. New York: Wiley.
 ^ Acton, F. S. (1959). Analysis of StraightLine Data. New York: Wiley.
 ^ Guest, P. G. (1961). Numerical Methods of Curve Fitting. Cambridge: Cambridge University Press.^{[page needed]}
Further reading
 Bevington, Philip R.; Robinson, Keith D. (2003). Data Reduction and Error Analysis for the Physical Sciences. McGrawHill. ISBN 9780072472271.