Index of dispersion

In probability theory and statistics, the index of dispersion,^[1] dispersion index, coefficient of dispersion, relative variance, or variance-to-mean ratio (VMR), like the coefficient of variation, is a normalized measure of the dispersion of a probability distribution: it is a measure used to quantify whether a set of observed occurrences are clustered or dispersed compared to a standard statistical model.

It is defined as the ratio of the variance $\sigma ^{2}$ to the mean $\mu$ ,

D={\sigma ^{2} \over \mu }.

It is also known as the Fano factor, though this term is sometimes reserved for windowed data (the mean and variance are computed over a subpopulation), where the index of dispersion is used in the special case where the window is infinite. Windowing data is frequently done: the VMR is frequently computed over various intervals in time or small regions in space, which may be called "windows", and the resulting statistic called the Fano factor.

It is only defined when the mean $\mu$ is non-zero, and is generally only used for positive statistics, such as count data or time between events, or where the underlying distribution is assumed to be the exponential distribution or Poisson distribution.

YouTube Encyclopedic

1/5
Views:
786 662
1 558
57 033
984
3 932

Transcription

In the last video we talked about different ways to represent the central tendency or the average of a data set. What we're going to do in this video is to expand that a little bit to understand how spread apart the data is as well. So let's just think about this a little bit. Let's say I have negative 10, 0, 10, 20 and 30. Let's say that's one data set right there. And let's say the other data set is 8, 9, 10, 11 and 12. Now let's calculate the arithmetic mean for both of these data sets. So let's calculate the mean. And when you go further on in statistics, you're going to understand the difference between a population and a sample. We're assuming that this is the entire population of our data. So we're going to be dealing with the population mean. We're going to be dealing with, as you see, the population measures of dispersion. I know these are all fancy words. In the future, you're not going to have all of the data. You're just going to have some samples of it, and you're going to try to estimate things for the entire population. So I don't want you to worry too much about that just now. But if you are going to go further in statistics, I just want to make that clarification. Now, the population mean, or the arithmetic mean of this data set right here, it is negative 10 plus 0 plus 10 plus 20 plus 30 over-- we have five data points-- over 5. And what is this equal to? That negative 10 cancels out with that 10, 20 plus 30 is 50 divided by 5, it's equal to 10. Now, what's the mean of this data set? 8 plus 9 plus 10 plus 11 plus 12, all of that over 5. And the way we could think about it, 8 plus 12 is 20, 9 plus 11 is another 20, so that's 40, and then we have a 50 there. Add another 10. So this, once again, is going to be 50 over 5. So this has the exact same population means. Or if you don't want to worry about the word population or sample and all of that, both of these data sets have the exact same arithmetic mean. When you average all these numbers and divide by 5 or when you take the sum of these numbers and divide by 5, you get 10, some of these numbers and divide by 5, you get 10 as well. But clearly, these sets of numbers are different. You know, if you just looked at this number, you'd say, oh, maybe these sets are very similar to each other. But when you look at these two data sets, one thing might pop out at you. All of these numbers are very close to 10. I mean, the furthest number here is two away from 10. 12 is only two away from 10. Here, these numbers are further away from 10. Even the closer ones are still 10 away and these guys are 20 away from 10. So this right here, this data set right here is more disperse, right? These guys are further away from our mean than these guys are from this mean. So let's think about different ways we can measure dispersion, or how far away we are from the center, on average. Now one way, this is kind of the most simple way, is the range. And you won't see it used too often, but it's kind of a very simple way of understanding how far is the spread between the largest and the smallest number. You literally take the largest number, which is 30 in our example, and from that, you subtract the smallest number. So 30 minus negative 10, which is equal to 40, which tells us that the difference between the largest and the smallest number is 40, so we have a range of 40 for this data set. Here, the range is the largest number, 12, minus the smallest number, which is 8, which is equal to 4. So here range is actually a pretty good measure of dispersion. We say, OK, both of these guys have a mean of 10. But when I look at the range, this guy has a much larger range, so that tells me this is a more disperse set. But range is always not going to tell you the whole picture. You might have two data sets with the exact same range where still, based on how things are bunched up, it could still have very different distributions of where the numbers lie. Now, the one that you'll see used most often is called the variance. Actually, we're going to see the standard deviation in this video. That's probably what's used most often, but it has a very close relationship to the variance. So the symbol for the variance-- and we're going to deal with the population variance. Once again, we're assuming that this is all of the data for our whole population, that we're not just sampling, taking a subset, of the data. So the variance, its symbol is literally this sigma, this Greek letter, squared. That is the symbol for variance. And we'll see that the sigma letter actually is the symbol for standard deviation. And that is for a reason. But anyway, the definition of a variance is you literally take each of these data points, find the difference between those data points and your mean, square them, and then take the average of those squares. I know that sounds very complicated, but when I actually calculate it, you're going to see it's not too bad. So remember, the mean here is 10. So I take the first data point. Let me do it over here. Let me scroll down a little bit. So I take the first data point. Negative 10. From that, I'm going to subtract our mean and I'm going to square that. So I just found the difference from that first data point to the mean and squared it. And that's essentially to make it positive. Plus the second data point, 0 minus 10, minus the mean-- this is the mean; this is that 10 right there-- squared plus 10 minus 10 squared-- that's the middle 10 right there-- plus 20 minus 10-- that's the 20-- squared plus 30 minus 10 squared. So this is the squared differences between each number and the mean. This is the mean right there. I'm finding the difference between every data point and the mean, squaring them, summing them up, and then dividing by that number of data points. So I'm taking the average of these numbers, of the squared distances. So when you say it kind of verbally, it sounds very complicated. But you're taking each number. What's the difference between that, the mean, square it, take the average of those. So I have 1, 2, 3, 4, 5, divided by 5. So what is this going to be equal to? Negative 10 minus 10 is negative 20. Negative 20 squared is 400. 0 minus 10 is negative 10 squared is 100, so plus 100. 10 minus 10 squared, that's just 0 squared, which is 0. Plus 20 minus 10 is 10 squared, is 100. Plus 30 minus 10, which is 20, squared is 400. All of that over 5. And what do we have here? 400 plus 100 is 500, plus another 500 is 1000. It's equal to 1000/5, which is equal to 200. So in this situation, our variance is going to be 200. That's our measure of dispersion there. And let's compare it to this data set over here. Let's compare it to the variance of this less-dispersed data set. So let me scroll over a little bit so we have some real estate, although I'm running out. Maybe I could scroll up here. There you go. Let me calculate the variance of this data set. So we already know its mean. So its variance of this data set is going to be equal to 8 minus 10 squared plus 9 minus 10 squared plus 10 minus 10 squared plus 11 minus 10-- let me scroll up a little bit-- squared plus 12 minus 10 squared. Remember, that 10 is just the mean that we calculated. You have to calculate the mean first. Divided by-- we have 1, 2, 3, 4, 5 squared differences. So this is going to be equal to-- 8 minus 10 is negative 2 squared, is positive 4. 9 minus 10 is negative 1 squared, is positive 1. 10 minus 10 is 0 squared. You still get 0. 11 minus 10 is 1. Square it, you get 1. 12 minus 10 is 2. Square it, you get 4. And what is this equal to? All of that over 5. This is 10/5. So this is going to be--all right, this is 10/5, which is equal to 2. So the variance here-- let me make sure I got that right. Yes, we have 10/5. So the variance of this less-dispersed data set is a lot smaller. The variance of this data set right here is only 2. So that gave you a sense. That tells you, look, this is definitely a less-dispersed data set then that there. Now, the problem with the variance is you're taking these numbers, you're taking the difference between them and the mean, then you're squaring it. It kind of gives you a bit of an arbitrary number, and if you're dealing with units, let's say if these are distances. So this is negative 10 meters, 0 meters, 10 meters, this is 8 meters, so on and so forth, then when you square it, you get your variance in terms of meters squared. It's kind of an odd set of units. So what people like to do is talk in terms of standard deviation, which is just the square root of the variance, or the square root of sigma squared. And the symbol for the standard deviation is just sigma. So now that we've figured out the variance, it's very easy to figure out the standard deviation of both of these characters. The standard deviation of this first one up here, of this first data set, is going to be the square root of 200. The square root of 200 is what? The square root of 2 times 100. This is equal to 10 square roots of 2. That's that first data set. Now the standard deviation of the second data set is just going to be the square root of its variance, which is just 2. So the second data set has 1/10 the standard deviation as this first data set. This is 10 roots of 2, this is just the root of 2. So this is 10 times the standard deviation. And this, hopefully, will make a little bit more sense. Let's think about it. This has 10 times more the standard deviation than this. And let's remember how we calculated it. Variance, we just took each data point, how far it was away from the mean, squared that, took the average of those. Then we took the square root, really just to make the units look nice, but the end result is we said that that first data set has 10 times the standard deviation as the second data set. So let's look at the two data sets. This has 10 times the standard deviation, which makes sense intuitively, right? I mean, they both have a 10 in here, but each of these guys, 9 is only one away from the 10, 0 is 10 away from the 10, 10 less. 8 is only two away. This guy is 20 away. So it's 10 times, on average, further away. So the standard deviation, at least in my sense, is giving a much better sense of how far away, on average, we are from the mean. Anyway, hopefully, you found that useful.

Terminology

In this context, the observed dataset may consist of the times of occurrence of predefined events, such as earthquakes in a given region over a given magnitude, or of the locations in geographical space of plants of a given species. Details of such occurrences are first converted into counts of the numbers of events or occurrences in each of a set of equal-sized time- or space-regions.

The above defines a dispersion index for counts.^[2] A different definition applies for a dispersion index for intervals,^[3] where the quantities treated are the lengths of the time-intervals between the events. Common usage is that "index of dispersion" means the dispersion index for counts.

Interpretation

Some distributions, most notably the Poisson distribution, have equal variance and mean, giving them a VMR = 1. The geometric distribution and the negative binomial distribution have VMR > 1, while the binomial distribution has VMR < 1, and the constant random variable has VMR = 0. This yields the following table:

Distribution	VMR
constant random variable	VMR = 0	not dispersed
binomial distribution	0 < VMR < 1	under-dispersed
Poisson distribution	VMR = 1
negative binomial distribution	VMR > 1	over-dispersed

This can be considered analogous to the classification of conic sections by eccentricity; see Cumulants of particular probability distributions for details.

The relevance of the index of dispersion is that it has a value of 1 when the probability distribution of the number of occurrences in an interval is a Poisson distribution. Thus the measure can be used to assess whether observed data can be modeled using a Poisson process. When the coefficient of dispersion is less than 1, a dataset is said to be "under-dispersed": this condition can relate to patterns of occurrence that are more regular than the randomness associated with a Poisson process. For instance, regular, periodic events will be under-dispersed. If the index of dispersion is larger than 1, a dataset is said to be over-dispersed.

A sample-based estimate of the dispersion index can be used to construct a formal statistical hypothesis test for the adequacy of the model that a series of counts follow a Poisson distribution.^[4]^[5] In terms of the interval-counts, over-dispersion corresponds to there being more intervals with low counts and more intervals with high counts, compared to a Poisson distribution: in contrast, under-dispersion is characterised by there being more intervals having counts close to the mean count, compared to a Poisson distribution.

The VMR is also a good measure of the degree of randomness of a given phenomenon. For example, this technique is commonly used in currency management.

Example

For randomly diffusing particles (Brownian motion), the distribution of the number of particle inside a given volume is poissonian, i.e. VMR=1. Therefore, to assess if a given spatial pattern (assuming you have a way to measure it) is due purely to diffusion or if some particle-particle interaction is involved : divide the space into patches, Quadrats or Sample Units (SU), count the number of individuals in each patch or SU, and compute the VMR. VMRs significantly higher than 1 denote a clustered distribution, where random walk is not enough to smother the attractive inter-particle potential.

History

The first to discuss the use of a test to detect deviations from a Poisson or binomial distribution appears to have been Lexis in 1877. One of the tests he developed was the Lexis ratio.

This index was first used in botany by Clapham in 1936.

Hoel studied the first four moments of its distribution.^[6] He found that the approximation to the χ² statistic is reasonable if μ > 5.

Skewed distributions

For highly skewed distributions, it may be more appropriate to use a linear loss function, as opposed to a quadratic one. The analogous coefficient of dispersion in this case is the ratio of the average absolute deviation from the median to the median of the data,^[7] or, in symbols:

CD={\frac {1}{n}}{\frac {\sum _{j}{|m-x_{j}|}}{m}}

where n is the sample size, m is the sample median and the sum taken over the whole sample. Iowa, New York and South Dakota use this linear coefficient of dispersion to estimate dues taxes.^[8]^[9]^[10]

For a two-sample test in which the sample sizes are large, both samples have the same median, and differ in the dispersion around it, a confidence interval for the linear coefficient of dispersion is bounded inferiorly by

{\frac {t_{a}}{t_{b}}}\exp {\left(-{\sqrt {z_{\alpha }\left(\operatorname {var} \left[\log \left({\frac {t_{a}}{t_{b}}}\right)\right]\right)}}\right)}

where t_j is the mean absolute deviation of the j^th sample and z_α is the confidence interval length for a normal distribution of confidence α (e.g., for α = 0.05, z_α = 1.96).^[7]

Notes

^ Cox &Lewis (1966)
^ Cox & Lewis (1966), p72
^ Cox & Lewis (1966), p71
^ Cox & Lewis (1966), p158
^ Upton & Cook(2006), under index of dispersion
^ Hoel, P. G. (1943). "On Indices of Dispersion". Annals of Mathematical Statistics. 14 (2): 155–162. doi:10.1214/aoms/1177731457. JSTOR 2235818.
^ ^a ^b Bonett, DG; Seier, E (2006). "Confidence interval for a coefficient of dispersion in non-normal distributions". Biometrical Journal. 48 (1): 144–148. doi:10.1002/bimj.200410148. PMID 16544819. S2CID 33665632.
^ "Statistical Calculation Definitions for Mass Appraisal" (PDF). Iowa.gov. Archived from the original (PDF) on 11 November 2010. Median Ratio: The ratio located midway between the highest ratio and the lowest ratio when individual ratios for a class of realty are ranked in ascending or descending order. The median ratio is most frequently used to determine the level of assessment for a given class of real estate.
^ "Assessment equity in New York: Results from the 2010 market value survey". Archived from the original on 6 November 2012.
^ "Summary of the Assessment Process" (PDF). state.sd.us. South Dakota Department of Revenue - Property/Special Taxes Division. Archived from the original (PDF) on 10 May 2009.

References

Cox, D. R.; Lewis, P. A. W. (1966). The Statistical Analysis of Series of Events. London: Methuen.
Upton, G.; Cook, I. (2006). Oxford Dictionary of Statistics (2nd ed.). Oxford University Press. ISBN 978-0-19-954145-4.

Statistics

Descriptive statistics

Continuous data

Center	Mean Arithmetic Arithmetic-Geometric Cubic Generalized/power Geometric Harmonic Heronian Heinz Lehmer Median Mode
Dispersion	Average absolute deviation Coefficient of variation Interquartile range Percentile Range Standard deviation Variance
Shape	Central limit theorem Moments Kurtosis L-moments Skewness

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Effect size Missing data Optimal design Population Replication Sample size determination Statistic Statistical power
Survey methodology	Sampling Cluster Stratified Opinion poll Questionnaire Standard error
Controlled experiments	Blocking Factorial experiment Interaction Random assignment Randomized controlled trial Randomized experiment Scientific control
Adaptive designs	Adaptive clinical trial Stochastic approximation Up-and-down designs
Observational studies	Cohort study Cross-sectional study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Score/Lagrange multiplier Wald

Specific tests

Z-test (normal) Student's t-test F-test
Goodness of fit	Chi-squared G-test Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra) Van der Waerden test

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Least-squares spectral analysis Wavelet Whittle likelihood

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics Jurimetrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging