Lack-of-fit sum of squares

In statistics, a sum of squares due to lack of fit, or more tersely a lack-of-fit sum of squares, is one of the components of a partition of the sum of squares of residuals in an analysis of variance, used in the numerator in an F-test of the null hypothesis that says that a proposed model fits well. The other component is the pure-error sum of squares.

The pure-error sum of squares is the sum of squared deviations of each value of the dependent variable from the average value over all observations sharing its independent variable value(s). These are errors that could never be avoided by any predictive equation that assigned a predicted value for the dependent variable as a function of the value(s) of the independent variable(s). The remainder of the residual sum of squares is attributed to lack of fit of the model since it would be mathematically possible to eliminate these errors entirely.

YouTube Encyclopedic

1/3
Views:
290 063
8 107
35 220

Transcription

In the next few videos I'm going to embark on something that will just result in a formula that's pretty straightforward to apply. And in most statistics classes, you'll just see that end product. But I actually want to show how to get there. But I just want to warn you right now. It's going to be a lot of hairy math, most of it hairy algebra. And then we're actually going have to do a little bit of calculus near the end. We're going to have to do a few partial derivatives. So if any of that sounds daunting, or sounds like something that will discourage you in some way, you don't have to watch it. You could skip to the end and just get the formula that we're going to derive. But I, at least, find it pretty satisfying to actually derive it. So what we're going to think about here is, let's say we have n points on a coordinate plane. And they all don't have to be in the first quadrant. But just for simplicity of visualization, I'll draw them all in the first quadarant. So let's say I have this point right over here. Let me do them in different colors. And that coordinate is x1, y1. And then let's say I have another point over here. The coordinates there are x2, y2. And then I can keep adding points. And I could keep drawing them. We'd just have a ton of points. There and there and there. And we go all the way to the nth point. Maybe it's over here. And we're just going to call that xn, yn. So we have n points here. I haven't drawn all of the actual points. But what I want to do is find a line that minimizes the squared distances to these different points. So let's think about it. Let's visualize that line for a second. So there's going to be some line. And I'm going to try to draw a line that kind of approximates what these points are doing. So let me draw this line here. So maybe the line might look something like this. I'm going to try my best to approximate it. Actually, let me draw it little bit different. Maybe it looks something like that. I don't even know what it looks like right now. And what we want to do is minimize this squared error from each of these points to the line. So let's think about what that means. So if the equation of this line right here is y is equal to mx plus b. And this just comes straight out of Algebra 1. This is the slope on the line, and this is the y-intercept. This is actually the point 0, b. What I want to do, and that's what the the topic of the next few videos are going to be, I want to find an m and a b. So I want to find these two things that define this line. So that it minimizes the squared error. So let me define what the error even is. So for each of these points, the error between it and the line is the vertical distance. So this right here we can call error one. And then this right here would be error two. It would be the vertical distance between that point and the line. Or you can think of it as the y value of this point and the y value of the line. And you just keep going all the way to the endpoint between the y value of this point and the y value of the line. So this error right here, error one, if you think about it, it is this value right here, this y value. It's equal to y1 minus this y value. Well what's this y value going to be? Well over here we have x is equal to x1. And this point is the point m x1 plus b. You take x1 into this equation of the line and you're going to get this point right over here. So that's literally going to be equal to m x1 plus b. That's that first error. And we can keep doing it with all the points. This error right over here is going to be y2 minus m x2 plus b. And then this point right here is m x2 plus b. The value when you take x2 into this line. And we keep going all the way to our nth point. This error right here is going to be yn minus m xn plus b. Now, so if we wanted to just take the straight up sum of the errors, we could just some these things up. But what we want to do is a minimize the square of the error between each of these points, each of these n points on the line. So let me define the squared error against this line as being equal to the sum of these squared errors. So this error right here, or error one we could call it, is y1 minus m x1 plus b. And we're going to square it. So this is the error one squared. And we're going to go to error two squared. Error two squared is y2 minus m x2 plus b. And then we're going to square that error. And then we keep going, we're going to go n spaces, or n points I should say. We keep going all the way to this nth error. The nth error is going to be yn minus m xn plus b. And then we're going to square it. So this is the squared error of the line. And over the next few videos, is I want to find the m and b that minimizes the squared error of this line right here. So if you viewed this as the best metric for how good a fit a line is, we're going to try to find the best fitting line for these points. And I'll continue in the next video. Because I find that with these very hairy math problems, it's good to kind of just deliver one concept at a time. And it also minimizes my probability of making a mistake.

Principle

In order for the lack-of-fit sum of squares to differ from the sum of squares of residuals, there must be more than one value of the response variable for at least one of the values of the set of predictor variables. For example, consider fitting a line

y=\alpha x+\beta \,

by the method of least squares. One takes as estimates of α and β the values that minimize the sum of squares of residuals, i.e., the sum of squares of the differences between the observed y-value and the fitted y-value. To have a lack-of-fit sum of squares that differs from the residual sum of squares, one must observe more than one y-value for each of one or more of the x-values. One then partitions the "sum of squares due to error", i.e., the sum of squares of residuals, into two components:

sum of squares due to error = (sum of squares due to "pure" error) + (sum of squares due to lack of fit).

The sum of squares due to "pure" error is the sum of squares of the differences between each observed y-value and the average of all y-values corresponding to the same x-value.

The sum of squares due to lack of fit is the weighted sum of squares of differences between each average of y-values corresponding to the same x-value and the corresponding fitted y-value, the weight in each case being simply the number of observed y-values for that x-value.^[1]^[2] Because it is a property of least squares regression that the vector whose components are "pure errors" and the vector of lack-of-fit components are orthogonal to each other, the following equality holds:

{\begin{aligned}&\sum ({\text{observed value}}-{\text{fitted value}})^{2}&&{\text{(error)}}\\&\qquad =\sum ({\text{observed value}}-{\text{local average}})^{2}&&{\text{(pure error)}}\\&\qquad \qquad {}+\sum {\text{weight}}\times ({\text{local average}}-{\text{fitted value}})^{2}&&{\text{(lack of fit)}}\end{aligned}}

Hence the residual sum of squares has been completely decomposed into two components.

Mathematical details

Consider fitting a line with one predictor variable. Define i as an index of each of the n distinct x values, j as an index of the response variable observations for a given x value, and n_i as the number of y values associated with the i ^th x value. The value of each response variable observation can be represented by

Y_{ij}=\alpha x_{i}+\beta +\varepsilon _{ij},\qquad i=1,\dots ,n,\quad j=1,\dots ,n_{i}.

Let

{\widehat {\alpha }},{\widehat {\beta }}\,

be the least squares estimates of the unobservable parameters α and β based on the observed values of x_i and Y_i j.

Let

{\widehat {Y}}_{i}={\widehat {\alpha }}x_{i}+{\widehat {\beta }}\,

be the fitted values of the response variable. Then

{\widehat {\varepsilon }}_{ij}=Y_{ij}-{\widehat {Y}}_{i}\,

are the residuals, which are observable estimates of the unobservable values of the error term ε_ij. Because of the nature of the method of least squares, the whole vector of residuals, with

N=\sum _{i=1}^{n}n_{i}

scalar components, necessarily satisfies the two constraints

\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}=0\,

\sum _{i=1}^{n}\left(x_{i}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}\right)=0.\,

It is thus constrained to lie in an (N − 2)-dimensional subspace of R^N, i.e. there are N − 2 "degrees of freedom for error".

Now let

{\overline {Y}}_{i\bullet }={\frac {1}{n_{i}}}\sum _{j=1}^{n_{i}}Y_{ij}

be the average of all Y-values associated with the i ^th x-value.

We partition the sum of squares due to error into two components:

{\begin{aligned}&\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}^{\,2}=\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\widehat {Y}}_{i}\right)^{2}\\&=\underbrace {\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\overline {Y}}_{i\bullet }\right)^{2}} _{\text{(sum of squares due to pure error)}}+\underbrace {\sum _{i=1}^{n}n_{i}\left({\overline {Y}}_{i\bullet }-{\widehat {Y}}_{i}\right)^{2}.} _{\text{(sum of squares due to lack of fit)}}\end{aligned}}

Probability distributions

Sums of squares

Suppose the error terms ε_i j are independent and normally distributed with expected value 0 and variance σ². We treat x_i as constant rather than random. Then the response variables Y_i j are random only because the errors ε_i j are random.

It can be shown to follow that if the straight-line model is correct, then the sum of squares due to error divided by the error variance,

{\frac {1}{\sigma ^{2}}}\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}{\widehat {\varepsilon }}_{ij}^{\,2}

has a chi-squared distribution with N − 2 degrees of freedom.

Moreover, given the total number of observations N, the number of levels of the independent variable n, and the number of parameters in the model p:

The sum of squares due to pure error, divided by the error variance σ², has a chi-squared distribution with N − n degrees of freedom;
The sum of squares due to lack of fit, divided by the error variance σ², has a chi-squared distribution with n − p degrees of freedom (here p = 2 as there are two parameters in the straight-line model);
The two sums of squares are probabilistically independent.

The test statistic

It then follows that the statistic

{\begin{aligned}F&={\frac {{\text{lack-of-fit sum of squares}}/{\text{degrees of freedom}}}{{\text{pure-error sum of squares}}/{\text{degrees of freedom}}}}\\[8pt]&={\frac {\left.\sum _{i=1}^{n}n_{i}\left({\overline {Y}}_{i\bullet }-{\widehat {Y}}_{i}\right)^{2}\right/(n-p)}{\left.\sum _{i=1}^{n}\sum _{j=1}^{n_{i}}\left(Y_{ij}-{\overline {Y}}_{i\bullet }\right)^{2}\right/(N-n)}}\end{aligned}}

has an F-distribution with the corresponding number of degrees of freedom in the numerator and the denominator, provided that the model is correct. If the model is wrong, then the probability distribution of the denominator is still as stated above, and the numerator and denominator are still independent. But the numerator then has a noncentral chi-squared distribution, and consequently the quotient as a whole has a non-central F-distribution.

One uses this F-statistic to test the null hypothesis that the linear model is correct. Since the non-central F-distribution is stochastically larger than the (central) F-distribution, one rejects the null hypothesis if the F-statistic is larger than the critical F value. The critical value corresponds to the cumulative distribution function of the F distribution with x equal to the desired confidence level, and degrees of freedom d₁ = (n − p) and d₂ = (N − n).

The assumptions of normal distribution of errors and independence can be shown to entail that this lack-of-fit test is the likelihood-ratio test of this null hypothesis.

Notes

^ Brook, Richard J.; Arnold, Gregory C. (1985). Applied Regression Analysis and Experimental Design. CRC Press. pp. 48–49. ISBN 0824772520.
^ Neter, John; Kutner, Michael H.; Nachstheim, Christopher J.; Wasserman, William (1996). Applied Linear Statistical Models (Fourth ed.). Chicago: Irwin. pp. 121–122. ISBN 0256117365.

This page was last edited on 3 March 2023, at 09:50

From Wikipedia, the free encyclopedia