To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

4,5
Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.
.
Leo
Newton
Brights
Milds

Bessel's correction

From Wikipedia, the free encyclopedia

In statistics, Bessel's correction is the use of n − 1 instead of n in the formula for the sample variance and sample standard deviation,[1] where n is the number of observations in a sample. This method corrects the bias in the estimation of the population variance. It also partially corrects the bias in the estimation of the population standard deviation. However, the correction often increases the mean squared error in these estimations. This technique is named after Friedrich Bessel.

YouTube Encyclopedic

  • 1/5
    Views:
    309 261
    34 190
    16 251
    1 453
    48 505
  • Review and intuition why we divide by n-1 for the unbiased sample | Khan Academy
  • Bessel's Correction - Intro to Descriptive Statistics
  • Unbiased Estimators (Why n-1 ???) : Data Science Basics
  • Bessel's Correction - Intro to Descriptive Statistics
  • FINALLY! Why we divide by N-1 for Sample Variance and Standard Deviation

Transcription

What I want to do in this video is review much of what we've already talked about and then hopefully build some of the intuition on why we divide by n minus 1 if we want to have an unbiased estimate of the population variance when we're calculating the sample variance. So let's think about a population. So let's say this is the population right over here. And it is of size capital N. And we also have a sample of that population, so a sample of that population. And in its size, we have lowercase n data points. So let's think about all of the parameters and statistics that we know about so far. So the first is the idea of the mean, of the mean. So if we're trying to calculate the mean for the population, is that going to be a parameter or a statistic? Well, when we're trying to calculate it on the population, we are calculating a parameter. We are calculating a parameter. So let me write this down. So this is going to be-- so for the population we are calculating a parameter. It is a parameter. And when we calculate, when we attempt to calculate something for a sample we would call that a statistic-- statistic. So how do we think about the mean for a population? Well, first of all, we denote it with the Greek letter mu. And we essentially take every data point in our population. So we take the sum of every data point. So we start at the first data point and we go all the way to the capital Nth data point. So every data point we add up. So this is the i-th data point, so x sub 1 plus x sub 2 all the way to x sub capital N. And then we divide by the total number of data points we have. Well, how do we calculate the sample mean? Well, the sample mean-- we do a very similar thing with the sample. And we denote it with a x with a bar over it. And that's going to be taking every data point in the sample, so going up to a lower case n, adding them up --so these are the sum of all the data points in our sample-- and then dividing by the number of data points that we actually had. Now, the other thing that we're trying to calculate for the population, which was a parameter, and then we'll also try to calculate it for the sample and estimate it for the population, was the variance, which was a measure of how dispersed or how much of the data points vary from the mean. So let's write variance right over here. And how do we denote any calculate variance for a population? Well, for population, we'd say that the variance --we use a Greek letter sigma squared-- is equal to-- and you can view it as the mean of the squared distances from the population mean. But what we do is we take, for each data point, so i equal 1 all the way to n, we take that data point, subtract from it the population mean. So if you want to calculate this, you'd want to figure this out. Well, that's one way to do it. We'll see there's other ways to do it, where you can calculate them at the same time. But the easiest or the most intuitive is to calculate this first, then for each of the data points take the data point and subtract it from that, subtract the mean from that, square it, and then divide by the total number of data points you have. Now, we get to the interesting part-- sample variance. There's are several ways-- where when people talk about sample variance, there's several tools in their toolkits or there's several ways to calculate it. One way is the biased sample variance, the non unbiased estimator of the population variance. And that's denoted, usually denoted, by s with a subscript n. And what is the biased estimator, how we calculate it? Well, we would calculate it very similar to how we calculated the variance right over here. But what we would do it for our sample, not our population. So for every data point in our sample --so we have n of them-- we take that data point. And from it, we subtract our sample mean. We subtract our sample mean, square it, and then divide by the number of data points that we have. But we already talked about it in the last video. How would we find-- what is our best unbiased estimate of the population variance? This is usually what we're trying to get at. We're trying to find an unbiased estimate of the population variance. Well, in the last video, we talked about that, if we want to have an unbiased estimate --and here, in this video, I want to give you a sense of the intuition why. We would take the sum. So we're going to go through every data point in our sample. We're going to take that data point, subtract from it the sample mean, square that. But instead of dividing by n, we divide by n minus 1. We're dividing by a smaller number. We're dividing by a smaller number. And when you divide by a smaller number, you're going to get a larger value. So this is going to be larger. This is going to be smaller. And this one, we refer to the unbiased estimate. And this one, we refer to the biased estimate. If people just write this, they're talking about the sample variance. It's a good idea to clarify which one they're talking about. But if you had to guess and people give you no further information, they're probably talking about the unbiased estimate of the variance. So you'd probably divide by n minus 1. But let's think about why this estimate would be biased and why we might want to have an estimate like that is larger. And then maybe in the future, we could have a computer program or something that really makes us feel better, that dividing by n minus 1 gives us a better estimate of the true population variance. So let's imagine all the data in a population. And I'm just going to plot them on number a line. So this is my number line. This is my number line. And let me plot all the data points in my population. So this is some data. This is some data. Here's some data. And here is some data here. And I can just do as many points as I want. So these are just points on the number line. Now, let's say I take a sample of this. So this is my entire population. So let's see how many. I have 1 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14. So in this case, what would be my big N? My big N would be 14. Big N would be 14. Now, let's say I take a sample, a lowercase n of-- let's say my sample size is 3. I could take-- well, before I even think about that, let's think about roughly where the mean of this population would sit. So the way I drew it --and I'm not going to calculate exactly-- it looks like the mean might sit some place roughly right over here. So the mean, the true population mean, the parameter's going to sit right over here. Now, let's think about what happens when we sample. And I'm going to do just a very small sample size just to give us the intuition, but this is true of any sample size. So let's say we have sample size of 3. So there is some possibility, when we take our sample size of 3, that we happen to sample it in a way that our sample mean is pretty close to our population mean. So for example, if we sampled to that point, that point, and that point, I could imagine in our sample mean might actually said pretty close, pretty close to our population mean. But there's a distinct possibility, there's a distinct possibility, that maybe when I take a sample, I sample that and that. And the key idea here is when you take a sample, your sample mean is always going to sit within your sample. And so there is a possibility that when you take your sample, your mean could even be outside of the sample. And so in this situation-- and this is just to give you an intuition. So here, your sample mean is going to be sitting someplace in there. And so if you were to just calculate the distance from each of this points to the sample mean --so this distance, that distance, and you square it, and you were to divide by the number of data points you have-- this is going to be a much lower estimate than the true variance the true variance, from the actual population mean, where these things are much, much, much further. Now, you're always not going to have the true population mean outside of your sample. But it's possible that you do. So in general, when you just take your points, find the squared distance to your sample mean, which is always going to sit inside of your data even though the true population mean could be outside of it, or it could be at one end of your data, however, you might want to think about it, you are likely to be underestimating, you're likely to be underestimating the true population variance. So this right over here is an underestimate-- underestimate. And it does turn out that if you just-- instead of dividing by n, you divide by n minus 1, you'll get a slightly larger sample variance. And this is an unbiased estimate. In the next video --and I might not to get to it immediately-- I would like to generate some type of a computer program that is more convincing that this is a better estimate, this is a better estimate of the population variance then this is.

Formulation

In estimating the population variance from a sample when the population mean is unknown, the uncorrected sample variance is the mean of the squares of deviations of sample values from the sample mean (i.e. using a multiplicative factor 1/n). In this case, the sample variance is a biased estimator of the population variance. Multiplying the uncorrected sample variance by the factor

gives an unbiased estimator of the population variance. In some literature,[2][3] the above factor is called Bessel's correction.

One can understand Bessel's correction as the degrees of freedom in the residuals vector (residuals, not errors, because the population mean is unknown):

where is the sample mean. While there are n independent observations in the sample, there are only n − 1 independent residuals, as they sum to 0. For a more intuitive explanation of the need for Bessel's correction, see § Source of bias.

Generally Bessel's correction is an approach to reduce the bias due to finite sample size. Such finite-sample bias correction is also needed for other estimates like skew and kurtosis, but in these the inaccuracies are often significantly larger. To fully remove such bias it is necessary to do a more complex multi-parameter estimation. For instance a correct correction for the standard deviation depends on the kurtosis (normalized central 4th moment), but this again has a finite sample bias and it depends on the standard deviation, i.e. both estimations have to be merged.

Caveats

There are three caveats to consider regarding Bessel's correction:

  1. It does not yield an unbiased estimator of standard deviation.
  2. The corrected estimator often has a higher mean squared error (MSE) than the uncorrected estimator.[4] Furthermore, there is no population distribution for which it has the minimum MSE because a different scale factor can always be chosen to minimize MSE.
  3. It is only necessary when the population mean is unknown (and estimated as the sample mean). In practice, this generally happens.

Firstly, while the sample variance (using Bessel's correction) is an unbiased estimator of the population variance, its square root, the sample standard deviation, is a biased estimate of the population standard deviation; because the square root is a concave function, the bias is downward, by Jensen's inequality. There is no general formula for an unbiased estimator of the population standard deviation, though there are correction factors for particular distributions, such as the normal; see unbiased estimation of standard deviation for details. An approximation for the exact correction factor for the normal distribution is given by using n − 1.5 in the formula: the bias decays quadratically (rather than linearly, as in the uncorrected form and Bessel's corrected form).

Secondly, the unbiased estimator does not minimize mean squared error (MSE), and generally has worse MSE than the uncorrected estimator (this varies with excess kurtosis). MSE can be minimized by using a different factor. The optimal value depends on excess kurtosis, as discussed in mean squared error: variance; for the normal distribution this is optimized by dividing by n + 1 (instead of n − 1 or n).

Thirdly, Bessel's correction is only necessary when the population mean is unknown, and one is estimating both population mean and population variance from a given sample, using the sample mean to estimate the population mean. In that case there are n degrees of freedom in a sample of n points, and simultaneous estimation of mean and variance means one degree of freedom goes to the sample mean and the remaining n − 1 degrees of freedom (the residuals) go to the sample variance. However, if the population mean is known, then the deviations of the observations from the population mean have n degrees of freedom (because the mean is not being estimated – the deviations are not residuals but errors) and Bessel's correction is not applicable.

Source of bias

Most simply, to understand the bias that needs correcting, think of an extreme case. Suppose the population is (0,0,0,1,2,9), which has a population mean of 2 and a population variance of . A sample of n = 1 is drawn, and it turns out to be The best estimate of the population mean is But what if we use the formula to estimate the variance? The estimate of the variance would be zero – and the estimate would be zero for any population and any sample of n = 1. The problem is that in estimating the sample mean, the process has already made our estimate of the mean close to the value we sampled--identical, for n = 1. In the case of n = 1, the variance just cannot be estimated, because there is no variability in the sample.

But consider n = 2. Suppose the sample were (0, 2). Then and , but with Bessel's correction, , which is an unbiased estimate (if all possible samples of n = 2 are taken and this method is used, the average estimate will be 12.4, same as the sample variance with Bessel's correction.)

To see this in more detail, consider the following example. Suppose the mean of the whole population is 2050, but the statistician does not know that, and must estimate it based on this small sample chosen randomly from the population:

One may compute the sample average:

This may serve as an observable estimate of the unobservable population average, which is 2050. Now we face the problem of estimating the population variance. That is the average of the squares of the deviations from 2050. If we knew that the population average is 2050, we could proceed as follows:

But our estimate of the population average is the sample average, 2052. The actual average, 2050, is unknown. So the sample average, 2052, must be used:

The variance is now smaller, and it (almost) always is. The only exception occurs when the sample average and the population average are the same. To understand why, consider that variance measures distance from a point, and within a given sample, the average is precisely that point which minimises the distances. A variance calculation using any other average value must produce a larger result.

To see this algebraically, we use a simple identity:

With representing the deviation of an individual sample from the sample mean, and representing the deviation of the sample mean from the population mean. Note that we've simply decomposed the actual deviation of an individual sample from the (unknown) population mean into two components: the deviation of the single sample from the sample mean, which we can compute, and the additional deviation of the sample mean from the population mean, which we can not. Now, we apply this identity to the squares of deviations from the population mean:

Now apply this to all five observations and observe certain patterns:

The sum of the entries in the middle column must be zero because the term a will be added across all 5 rows, which itself must equal zero. That is because a contains the 5 individual samples (left side within parentheses) which – when added – naturally have the same sum as adding 5 times the sample mean of those 5 numbers (2052). This means that a subtraction of these two sums must equal zero. The factor 2 and the term b in the middle column are equal for all rows, meaning that the relative difference across all rows in the middle column stays the same and can therefore be disregarded. The following statements explain the meaning of the remaining columns:

  • The sum of the entries in the first column (a2) is the sum of the squares of the distance from sample to sample mean;
  • The sum of the entries in the last column (b2) is the sum of squared distances between the measured sample mean and the correct population mean
  • Every single row now consists of pairs of a2 (biased, because the sample mean is used) and b2 (correction of bias, because it takes the difference between the "real" population mean and the inaccurate sample mean into account). Therefore the sum of all entries of the first and last column now represents the correct variance, meaning that now the sum of squared distance between samples and population mean is used
  • The sum of the a2-column and the b2-column must be bigger than the sum within entries of the a2-column, since all the entries within the b2-column are positive (except when the population mean is the same as the sample mean, in which case all of the numbers in the last column will be 0).

Therefore:

  • The sum of squares of the distance from samples to the population mean will always be bigger than the sum of squares of the distance to the sample mean, except when the sample mean happens to be the same as the population mean, in which case the two are equal.

That is why the sum of squares of the deviations from the sample mean is too small to give an unbiased estimate of the population variance when the average of those squares is found. The smaller the sample size, the larger is the difference between the sample variance and the population variance.

Terminology

This correction is so common that the term "sample variance" and "sample standard deviation" are frequently used to mean the corrected estimators (unbiased sample variation, less biased sample standard deviation), using n − 1. However caution is needed: some calculators and software packages may provide for both or only the more unusual formulation. This article uses the following symbols and definitions:

  • μ is the population mean
  • is the sample mean
  • σ2 is the population variance
  • sn2 is the biased sample variance (i.e. without Bessel's correction)
  • s2 is the unbiased sample variance (i.e. with Bessel's correction)

The standard deviations will then be the square roots of the respective variances. Since the square root introduces bias, the terminology "uncorrected" and "corrected" is preferred for the standard deviation estimators:

  • sn is the uncorrected sample standard deviation (i.e. without Bessel's correction)
  • s is the corrected sample standard deviation (i.e. with Bessel's correction), which is less biased, but still biased

Formula

The sample mean is given by

The biased sample variance is then written:

and the unbiased sample variance is written:

Proof

Suppose thus that are independent and identically distributed random variables with expectation and variance .

Knowing the values of the at an outcome of the underlying sample space, we would like to get a good estimate for the variance , which is unknown. To this end, we construct a mathematical formula containing the such that the expectation of this formula is precisely . This means that on average, this formula should produce the right answer.

The educated, but naive way of guessing such a formula would be

,

where ; this would be the variance if we had a discrete random variable on the discrete probability space that had value at . But let us calculate the expected value of this expression:

here we have (by independence, symmetric cancellation and equal distribution)

and therefore

.

In contrast,

.

Therefore, our initial guess was wrong by a factor of

,

and this is precisely Bessel's correction.

See also

Notes

  1. ^ Radziwill, Nicole M (2017). Statistics (the easier way) with R. Lapis Lucera. ISBN 9780996916059. OCLC 1030532622.
  2. ^ W. J. Reichmann, W. J. (1961) Use and abuse of statistics, Methuen. Reprinted 1964–1970 by Pelican. Appendix 8.
  3. ^ Upton, G.; Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4 (entry for "Variance (data)")
  4. ^ Rosenthal, Jeffrey S. (2015). "The Kids are Alright: Divide by n when estimating variance". Bulletin of the Institute of Mathematical Statistics. December 2015: 9.

External links

This page was last edited on 5 March 2024, at 16:22
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.