Part of a series on Statistics 
Regression analysis 

Models 
Estimation 
Background 
In applied statistics, total least squares is a type of errorsinvariables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and nonlinear models.
The total least squares approximation of the data is generically equivalent to the best, in the Frobenius norm, lowrank approximation of the data matrix.^{[1]}
YouTube Encyclopedic

1/5Views:112 942495236 6169 91019 859

✪ Least squares  MIT 18.02SC Multivariable Calculus, Fall 2010

✪ Lecture31 (Data2Decision) Total Regression, part 2

✪ Least squares approximation  Linear Algebra  Khan Academy

✪ Chapter 16 Problems, Singular Value Decomposition, SVD & Least Squares

✪ The Least Squares Formula: A Derivation
Transcription
CHRISTINE BREINER: Welcome back to recitation. In this video, what I'd like you to do is use least squares to fit a line to the following data, which includes three points: the point 0, 1, the point 2, 1, and the point 3, 4. And I've drawn a rough picture where these points are on a graph, and I'll be talking a little bit about that after you try this problem. So why don't you try to solve the problem based on the technique you learned in class, and then bring the video back up. I'll show you again what you're actually doing, and then I will also solve the problem and you can see, you can compare your answer to mine. OK, welcome back. So the first thing I want to do when we're talking about least squares to fit a line to these data, what I'd like to do is I'm going to draw a line on here, and we're going to talk about what the least squares actually is trying to do. And so I'm going to draw this line, say, something like this. That's my first guess on what might be the actual least squares line for these data. And if you recall, what you're actually trying to do is you're trying to minimize a certain quantity, and the quantity you're trying to minimize is the difference between the actual value you get and the expected value you get, the square of that. So this is one thing we're going to square. So even pictorially, we could say we have something that contributes some area, OK, and then this is another thing that's the difference between the expected value and y and the actual value you get. So let me try and make that squared, because we square that value. And then this last one is pretty small. The difference between the expected value and the actual value is very small over in this case from the picture I drew. And so what you're trying to do is you're trying to find the line that minimizes the area of the sum of these three, in this case, three squares. So if I, you know, if I tip the line down further here, but I kept it fixed there, the area of this square would get bigger, because this distance would get bigger, but, of course, the area of this square would get smaller. And so you're trying to figure out a way to optimize, in this case, minimize, the area of these sorry, the sum of the areas of these squares by figuring out where to put the line. Now, I mentioned if I could fix this here and move it up and down, then that would change the slope but fix the intercept. But I could also move the intercept up and down, and I could slide the line up and down, and that would fix the slope but vary the intercept, or I could do both. And ultimately, that's what you're doing. You're trying to find the best yintercept and the best slope at the same time, and that's why you learned this process in multivariable calculus, because you're trying to optimize something that has two quantities that you can change. You can change slope and you can change the yintercept, so that's why you learned it now instead of in singlevariable calculus. Now, if we come over here, I wrote down ahead of time what the equations are you get when you optimize, when you're trying to optimize the least squares line. So if you write down, you know, the function as a function of a and b, and you take the derivative with respect to a and you set it equal to zero, you get the first equation. When you take the derivative with respect to b and you set it equal to zero, you get the second equation. And so I'm just going to fill in the details and then tell you what the solution is, and then I'm going to mention a little more sophisticated thing we can do, or I guess maybe the next level of least squares we can do, that's not a linear problem. So I'll do that last. So what I want to do is I put all the values I need over here on the right. So all the values we need are on the right. We're going to need the sum of all the xi's and that's 5. The sum of all the yi's is 6. The sum of the xi squareds is 13, and the sum of the xi yi product is 14. Now, where are these coming from? Remember, what is xi and what is yi? If we come back over here, this is x1, y1, x2, y2, x3, y3, OK? So that's sort of how we get if we switch the whole pairs around, it doesn't matter. Obviously, if we switched x and yvalues together, it does matter, and that you see actually from the equation does matter. Right here you have a product of them. So let me write this out. And again, we're solving for a and b, so in this case, a is the slope and b is the intercept, right? So let me write everything in. So we actually get that we want to solve the following system: 13a plus 5b is equal to 14. Make sure I put in everything correctly. And then the second one is 5a plus n in this case is 3, because there were three points plus 3b is equal to 6. OK, this is a system of equations. It has two equations and two unknowns, and if you solve this I should look at my notes, because I didn't memorize it and I'm not going to do all the algebra. If you go to solve this, what you actually get is that a is equal to 6/7 and b is equal to 4/7. So you get that the line let me write it here. You get a is equal to 6/7 and b is equal to 4/7, OK? And so what that tells you is if you come back over here, the line we actually want I'll write the solution right here the line we actually want is the line y equals 6/7x plus 4/7. That is our least squares line that is our solution. OK? So that is actually solving for a line to fit these data, you get the following solution. But I want to point out, and what we're going to do, after this video there will be a challenge problem to have you actually finish this. I want to point out that you can actually also fit a higherdegree polynomial to these data. For instance, you could try and use the technique of least squares to fit a parabola to these data. And let me point out what the function would look like in this case. So this is for the challenge problem. For the challenge problem, it now will be a function of three variables, so it will look something like this. We'll say a comma b comma c, and the point that we'll be wanting to minimize, or the function we'll be wanting to minimize, we're summing over i, again from 1 to 3 I didn't write that down above, but it was fairly obvious that that was what we were summing over the following thing. We have yi minus this big quantity: axi squared plus bxi plus c and we square the whole thing. And so what this is actually giving you is, just as in the picture before, this is giving you the difference between the expected value and the actual value. So this is your actual yvalue, and based on the parabola you're looking for, what's in parentheses is your expected yvalue, right? So this would actually be my parabola. When I evaluate it at xi, I get a number. I get a value, and that value will be on the parabola. This will be the yvalue associated to the xvalue from the data. So this is the actual value. This is the expected value. We take the difference and square it. And because we want to minimize how we're going to solve this problem and this is what you'll have to do is, just like with the line situation, you're going to take the derivative of this function with respect to each of these three variables separately. So you'll take f sub a, and then you'll set that equal to zero. And then you'll take f sub b, you'll set that equal to zero. And then you'll take f sub c, and you'll set that equal to zero. Again, we set them all equal to zero, because this is really an optimization problem. We want to minimize something. We want to minimize this quantity. So you take each of those three derivatives, partial derivatives, set them equal to zero, and you have a system of three equations with three variables. And you know how to solve such a system, by using matrices, for instance. That would be one way to solve such a system. And so then you can actually come up with a formula for the sort of parabola of best fit, you could think of it as. And so I'm going to stop there, because I think from here, you guys will do it. But I just want to point out this is where you're going to be starting this next sort of challenge problem to find the quadratic of best fit for these three data points. So that's where I'll end it.
Contents
Linear model
Background
In the least squares method of data modeling, the objective function, S,
is minimized, where r is the vector of residuals and W is a weighting matrix. In linear least squares the model contains equations which are linear in the parameters appearing in the parameter vector , so the residuals are given by
There are m observations in y and n parameters in β with m>n. X is a m×n matrix whose elements are either constants or functions of the independent variables, x. The weight matrix W is, ideally, the inverse of the variancecovariance matrix of the observations y. The independent variables are assumed to be errorfree. The parameter estimates are found by setting the gradient equations to zero, which results in the normal equations ^{[note 1]}
Allowing observation errors in all variables
Now, suppose that both x and y are observed subject to error, with variancecovariance matrices and respectively. In this case the objective function can be written as
where and are the residuals in x and y respectively. Clearly these residuals cannot be independent of each other, but they must be constrained by some kind of relationship. Writing the model function as , the constraints are expressed by m condition equations.^{[2]}
Thus, the problem is to minimize the objective function subject to the m constraints. It is solved by the use of Lagrange multipliers. After some algebraic manipulations,^{[3]} the result is obtained.
or alternatively where M is the variancecovariance matrix relative to both independent and dependent variables.
Example
When the data errors are uncorrelated, all matrices M and W are diagonal. Then, take the example of straight line fitting.
in this case
showing how the variance at the ith point is determined by the variances of both independent and dependent variables and by the model being used to fit the data. The expression may be generalized by noting that the parameter is the slope of the line.
An expression of this type is used in fitting pH titration data where a small error on x translates to a large error on y when the slope is large.
Algebraic point of view
First of all it is necessary to note that the TLS problem does not have a solution in general, which was already shown in 1980.^{[4]} The following considers the simple case where a unique solution exists without making any particular assumptions.
The computation of the TLS using singular value decomposition is described in standard texts.^{[5]} We can solve the equation
for B where X is mbyn and Y is mbyk. ^{[note 2]}
That is, we seek to find B that minimizes error matrices E and F for X and Y respectively. That is,
where is the augmented matrix with E and F side by side and is the Frobenius norm, the square root of the sum of the squares of all entries in a matrix and so equivalently the square root of the sum of squares of the lengths of the rows or columns of the matrix.
This can be rewritten as
where is the identity matrix. The goal is then to find that reduces the rank of by k. Define to be the singular value decomposition of the augmented matrix .
where V is partitioned into blocks corresponding to the shape of X and Y.
Using the Eckart–Young theorem, the approximation minimising the norm of the error is such that matrices and are unchanged, while the smallest singular values are replaced with zeroes. That is, we want
so by linearity,
We can then remove blocks from the U and Σ matrices, simplifying to
This provides E and F so that
Now if is nonsingular, which is not always the case (note that the behavior of TLS when is singular is not well understood yet), we can then right multiply both sides by to bring the bottom block of the right matrix to the negative identity, giving^{[6]}
and so
A naive GNU Octave implementation of this is:
function B = tls(X,Y)
[m n] = size(X); % n is the width of X (X is m by n)
Z = [X Y]; % Z is X augmented with Y.
[U S V] = svd(Z,0); % find the SVD of Z.
VXY = V(1:n,1+n:end); % Take the block of V consisting of the first n rows and the n+1 to last column
VYY = V(1+n:end,1+n:end); % Take the bottomright block of V.
B = VXY/VYY;
end
The way described above of solving the problem, which requires that the matrix is nonsingular, can be slightly extended by the socalled classical TLS algorithm.^{[7]}
Computation
The standard implementation of classical TLS algorithm is available through Netlib, see also.^{[8]}^{[9]} All modern implementations based, for example, on solving a sequence of ordinary least squares problems, approximate the matrix (denoted in the literature), as introduced by Van Huffel and Vandewalle. It is worth noting, that this is, however, not the TLS solution in many cases.^{[10]}^{[11]}
Nonlinear model
For nonlinear systems similar reasoning shows that the normal equations for an iteration cycle can be written as
Geometrical interpretation
When the independent variable is errorfree a residual represents the "vertical" distance between the observed data point and the fitted curve (or surface). In total least squares a residual represents the distance between a data point and the fitted curve measured along some direction. In fact, if both variables are measured in the same units and the errors on both variables are the same, then the residual represents the shortest distance between the data point and the fitted curve, that is, the residual vector is perpendicular to the tangent of the curve. For this reason, this type of regression is sometimes called two dimensional Euclidean regression (Stein, 1983)^{[12]} or orthogonal regression.
Scale invariant methods
A serious difficulty arises if the variables are not measured in the same units. First consider measuring distance between a data point and the line, as in the diagram – what are the measurement units for this distance? If we consider measuring distance based on Pythagoras' Theorem then it is clear that we shall be adding quantities measured in different units, which is meaningless. Secondly, if we rescale one of the variables e.g., measure in grams rather than kilograms, then we shall end up with different results (a different line). To avoid these problems it is sometimes suggested that we convert to dimensionless variables—this may be called normalization or standardization. However there are various ways of doing this, and these lead to fitted models which are not equivalent to each other. One approach is to normalize by known (or estimated) measurement precision thereby minimizing the Mahalanobis distance from the points to the line, providing a maximumlikelihood solution;^{[citation needed]} the unknown precisions could be found via analysis of variance.
In short, total least squares does not have the property of unitsinvariance—i.e. it is not scale invariant. For a meaningful model we require this property to hold. A way forward is to realise that residuals (distances) measured in different units can be combined if multiplication is used instead of addition. Consider fitting a line: for each data point the product of the vertical and horizontal residuals equals twice the area of the triangle formed by the residual lines and the fitted line. We choose the line which minimizes the sum of these areas. Nobel laureate Paul Samuelson proved in 1942 that, in two dimensions, it is the only line expressible solely in terms of the ratios of standard deviations and the correlation coefficient which (1) fits the correct equation when the observations fall on a straight line, (2) exhibits scale invariance, and (3) exhibits invariance under interchange of variables.^{[13]} This solution has been rediscovered in different disciplines and is variously known as standardised major axis (Ricker 1975, Warton et al., 2006),^{[14]}^{[15]} the reduced major axis, the geometric mean functional relationship (Draper and Smith, 1998),^{[16]} least products regression, diagonal regression, line of organic correlation, and the least areas line (Tofallis, 2002).^{[17]} Tofallis (2015)^{[18]} has extended this approach to deal with multiple variables.
See also
 Deming regression, a special case with two predictors and independent errors.
 Errorsinvariables model
 Linear regression
 Least squares
Notes
 ^ An alternative form is , where is the parameter shift from some starting estimate of and is the difference between y and the value calculated using the starting value of
 ^ The notation XB ≈ Y is used here to reflect the notation used in the earlier part of the article. In the computational literature the problem has been more commonly presented as AX ≈ B, i.e. with the letter X used for the nbyk matrix of unknown regression coefficients.
References
 ^ I. Markovsky and S. Van Huffel, Overview of total least squares methods. Signal Processing, vol. 87, pp. 2283–2302, 2007. preprint
 ^ W.E. Deming, Statistical Adjustment of Data, Wiley, 1943
 ^ Gans, Peter (1992). Data Fitting in the Chemical Sciences. Wiley. ISBN 9780471934127. Retrieved 4 December 2012.
 ^ G. H. Golub and C. F. Van Loan, An analysis of the total least squares problem. Numer. Anal., 17, 1980, pp. 883–893.
 ^ Golub, Gene H.; Van Loan, Charles F. (1996). Matrix Computations (3rd ed.). The Johns Hopkins University Press. pp 596.
 ^ Bjõrck, Ake (1996) Numerical Methods for Least Squares Problems, Society for Industrial and Applied Mathematics. ISBN 9780898713602^{[page needed]}
 ^ S. Van Huffel and J. Vandewalle (1991) The Total Least Squares Problems: Computational Aspects and Analysis. SIAM Publications, Philadelphia PA.
 ^ S. Van Huffel, Documented Fortran 77 programs of the extended classical total least squares algorithm, the partial singular value decomposition algorithm and the partial total least squares algorithm, Internal Report ESATKUL 88/1, ESAT Lab., Dept. of Electrical Engineering, Katholieke Universiteit Leuven, 1988.
 ^ S. Van Huffel, The extended classical total least squares algorithm, J. Comput. Appl. Math., 25, pp. 111–119, 1989.
 ^ M. Plešinger, The Total Least Squares Problem and Reduction of Data in AX ≈ B. Doctoral Thesis, TU of Liberec and Institute of Computer Science, AS CR Prague, 2008. Ph.D. Thesis
 ^ I. Hnětynková, M. Plešinger, D. M. Sima, Z. Strakoš, and S. Van Huffel, The total least squares problem in AX ≈ B. A new classification with the relationship to the classical works. SIMAX vol. 32 issue 3 (2011), pp. 748–770.
 ^ Stein, Yaakov J. "Two Dimensional Euclidean Regression" (PDF).
 ^ Samuelson, Paul A. (1942). "A Note on Alternative Regressions". Econometrica. 10 (1): 80–83. doi:10.2307/1907024. JSTOR 1907024.
 ^ Ricker, W. E. (1975). "A note concerning Professor Jolicoeur's Comments". Journal of the Fisheries Research Board of Canada. 32 (8): 1494–1498. doi:10.1139/f75172.
 ^ Warton, David I.; Wright, Ian J.; Falster, Daniel S.; Westoby, Mark (2006). "Bivariate linefitting methods for allometry". Biological Reviews. 81 (2): 259–291. CiteSeerX 10.1.1.461.9154. doi:10.1017/S1464793106007007.
 ^ Draper, NR and Smith, H. Applied Regression Analysis, 3rd edition, pp. 92–96. 1998
 ^ Tofallis, Chris (2002). "Model Fitting for Multiple Variables by Minimising the Geometric Mean Deviation". In Van Huffel, Sabine; Lemmerling, P. (eds.). Total Least Squares and ErrorsinVariables Modeling: Analysis, Algorithms and Applications. Dordrecht: Kluwer Academic Publ. ISBN 9781402004766. SSRN 1077322.
 ^ Tofallis, Chris (2015). "Fitting Equations to Data with the Perfect Correlation Relationship". SSRN 2707593.
Others
 I. Hnětynková, M. Plešinger, D. M. Sima, Z. Strakoš, and S. Van Huffel, The total least squares problem in AX ≈ B. A new classification with the relationship to the classical works. SIMAX vol. 32 issue 3 (2011), pp. 748–770. Available as a preprint.
 M. Plešinger, The Total Least Squares Problem and Reduction of Data in AX ≈ B. Doctoral Thesis, TU of Liberec and Institute of Computer Science, AS CR Prague, 2008. Ph.D. Thesis
 C. C. Paige, Z. Strakoš, Core problems in linear algebraic systems. SIAM J. Matrix Anal. Appl. 27, 2006, pp. 861–875. doi:10.1137/040616991
 S. Van Huffel and P. Lemmerling, Total Least Squares and ErrorsinVariables Modeling: Analysis, Algorithms and Applications. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2002.
 S. Jo and S. W. Kim, Consistent normalized least mean square filtering with noisy data matrix. IEEE Trans. Signal Process., vol. 53, no. 6, pp. 2112–2123, Jun. 2005.
 R. D. DeGroat and E. M. Dowling, The data least squares problem and channel equalization. IEEE Trans. Signal Process., vol. 41, no. 1, pp. 407–411, Jan. 1993.
 S. Van Huffel and J. Vandewalle, The Total Least Squares Problems: Computational Aspects and Analysis. SIAM Publications, Philadelphia PA, 1991. doi:10.1137/1.9781611971002
 T. Abatzoglou and J. Mendel, Constrained total least squares, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP’87), Apr. 1987, vol. 12, pp. 1485–1488.
 P. de Groen An introduction to total least squares, in Nieuw Archief voor Wiskunde, Vierde serie, deel 14, 1996, pp. 237–253 arxiv.org.
 G. H. Golub and C. F. Van Loan, An analysis of the total least squares problem. SIAM J. on Numer. Anal., 17, 1980, pp. 883–893. doi:10.1137/0717073
 Perpendicular Regression Of A Line at MathPages
 A. R. AmiriSimkooei and S. Jazaeri Weighted total least squares formulated by standard least squares theory,in Journal of Geodetic Science, 2 (2): 113–124, 2012 [1].