Limiting density of discrete points

Information theory

Entropy Differential entropy Conditional entropy Joint entropy Mutual information Directed information Conditional mutual information Relative entropy Entropy rate Limiting density of discrete points
Asymptotic equipartition property Rate–distortion theory
Shannon's source coding theorem Channel capacity Noisy-channel coding theorem Shannon–Hartley theorem
v t e

In information theory, the limiting density of discrete points is an adjustment to the formula of Claude Shannon for differential entropy.

It was formulated by Edwin Thompson Jaynes to address defects in the initial definition of differential entropy.

YouTube Encyclopedic

1/5
Views:
1 334 739
247 423
159 668
364 224
177 992

Transcription

In the last video, I introduced you to the notion of-- well, really we started with the random variable. And then we moved on to the two types of random variables. You had discrete, that took on a finite number of values. And the these, I was going to say that they tend to be integers, but they don't always have to be integers. You have discrete, so finite meaning you can't have an infinite number of values for a discrete random variable. And then we have the continuous, which can take on an infinite number. And the example I gave for continuous is, let's say random variable x. And people do tend to use-- let me change it a little bit, just so you can see it can be something other than an x. Let's have the random variable capital Y. They do tend to be capital letters. Is equal to the exact amount of rain tomorrow. And I say rain because I'm in northern California. It's actually raining quite hard right now. We're short right now, so that's a positive. We've been having a drought, so that's a good thing. But the exact amount of rain tomorrow. And let's say I don't know what the actual probability distribution function for this is, but I'll draw one and then we'll interpret it. Just so you can kind of think about how you can think about continuous random variables. So let me draw a probability distribution, or they call it its probability density function. And we draw like this. And let's say that there is-- it looks something like this. Like that. All right, and then I don't know what this height is. So the x-axis here is the amount of rain. Where this is 0 inches, this is 1 inch, this is 2 inches, this is 3 inches, 4 inches. And then this is some height. Let's say it peaks out here at, I don't know, let's say this 0.5. So the way to think about it, if you were to look at this and I were to ask you, what is the probability that Y-- because that's our random variable-- that Y is exactly equal to 2 inches? That Y is exactly equal to two inches. What's the probability of that happening? Well, based on how we thought about the probability distribution functions for the discrete random variable, you'd say OK, let's see. 2 inches, that's the case we care about right now. Let me go up here. You'd say it looks like it's about 0.5. And you'd say, I don't know, is it a 0.5 chance? And I would say no, it is not a 0.5 chance. And before we even think about how we would interpret it visually, let's just think about it logically. What is the probability that tomorrow we have exactly 2 inches of rain? Not 2.01 inches of rain, not 1.99 inches of rain. Not 1.99999 inches of rain, not 2.000001 inches of rain. Exactly 2 inches of rain. I mean, there's not a single extra atom, water molecule above the 2 inch mark. And not as single water molecule below the 2 inch mark. It's essentially 0, right? It might not be obvious to you, because you've probably heard, oh, we had 2 inches of rain last night. But think about it, exactly 2 inches, right? Normally if it's 2.01 people will say that's 2. But we're saying no, this does not count. It can't be 2 inches. We want exactly 2. 1.99 does not count. Normally our measurements, we don't even have tools that can tell us whether it is exactly 2 inches. No ruler you can even say is exactly 2 inches long. At some point, just the way we manufacture things, there's going to be an extra atom on it here or there. So the odds of actually anything being exactly a certain measurement to the exact infinite decimal point is actually 0. The way you would think about a continuous random variable, you could say what is the probability that Y is almost 2? So if we said that the absolute value of Y minus is 2 is less than some tolerance? Is less than 0.1. And if that doesn't make sense to you, this is essentially just saying what is the probability that Y is greater than 1.9 and less than 2.1? These two statements are equivalent. I'll let you think about it a little bit. But now this starts to make a little bit of sense. Now we have an interval here. So we want all Y's between 1.9 and 2.1. So we are now talking about this whole area. And area is key. So if you want to know the probability of this occurring, you actually want the area under this curve from this point to this point. And for those of you who have studied your calculus, that would essentially be the definite integral of this probability density function from this point to this point. So from-- let me see, I've run out of space down here. So let's say if this graph-- let me draw it in a different color. If this line was defined by, I'll call it f of x. I could call it p of x or something. The probability of this happening would be equal to the integral, for those of you who've studied calculus, from 1.9 to 2.1 of f of x dx. Assuming this is the x-axis. So it's a very important thing to realize. Because when a random variable can take on an infinite number of values, or it can take on any value between an interval, to get an exact value, to get exactly 1.999, the probability is actually 0. It's like asking you what is the area under a curve on just this line. Or even more specifically, it's like asking you what's the area of a line? An area of a line, if you were to just draw a line, you'd say well, area is height times base. Well the height has some dimension, but the base, what's the width the a line? As far as the way we've defined a line, a line has no with, and therefore no area. And it should make intuitive sense. That the probability of a very super-exact thing happening is pretty much 0. That you really have to say, OK what's the probably that we'll get close to 2? And then you can define an area. And if you said oh, what's the probability that we get someplace between 1 and 3 inches of rain, then of course the probability is much higher. The probability is much higher. It would be all of this kind of stuff. You could also say what's the probability we have less than 0.1 of rain? Then you would go here and if this was 0.1, you would calculate this area. And you could say what's the probability that we have more than 4 inches of rain tomorrow? Then you would start here and you'd calculate the area in the curve all the way to infinity, if the curve has area all the way to infinity. And hopefully that's not an infinite number, right? Then your probability won't make any sense. But hopefully if you take this sum it comes to some number. And we'll say there's only a 10% chance that you have more than 4 inches tomorrow. And all of this should immediately lead to one light bulb in your head, is that the probability of all of the events that might occur can't be more than 100%. Right? All the events combined-- there's a probability of 1 that one of these events will occur. So essentially, the whole area under this curve has to be equal to 1. So if we took the integral of f of x from 0 to infinity, this thing, at least as I've drawn it, dx should be equal to 1. For those of you who've studied calculus. For those of you who haven't, an integral is just the area under a curve. And you can watch the calculus videos if you want to learn a little bit more about how to do them. And this also applies to the discrete probability distributions. Let me draw one. The sum of all of the probabilities have to be equal to 1. And that example with the dice-- or let's say, since it's faster to draw, the coin-- the two probabilities have to be equal to 1. So this is 1, 0, where x is equal to 1 if we're heads or 0 if we're tails. Each of these have to be 0.5. Or they don't have to be 0.5, but if one was 0.6, the other would have to be 0.4. They have to add to 1. If one of these was-- you can't have a 60% probability of getting a heads and then a 60% probability of getting a tails as well. Because then you would have essentially 120% probability of either of the outcomes happening, which makes no sense at all. So it's important to realize that a probability distribution function, in this case for a discrete random variable, they all have to add up to 1. So 0.5 plus 0.5. And in this case the area under the probability density function also has to be equal to 1. Anyway, I'm all the time for now. In the next video I'll introduce you to the idea of an expected value. See you soon.

Definition

Shannon originally wrote down the following formula for the entropy of a continuous distribution, known as differential entropy:

h(X)=-\int p(x)\log p(x)\,dx.

Unlike Shannon's formula for the discrete entropy, however, this is not the result of any derivation (Shannon simply replaced the summation symbol in the discrete version with an integral), and it lacks many of the properties that make the discrete entropy a useful measure of uncertainty. In particular, it is not invariant under a change of variables and can become negative. In addition, it is not even dimensionally correct. Since $h(X)$ would be dimensionless, $p(x)$ must have units of ${\frac {1}{dx}}$ , which means that the argument to the logarithm is not dimensionless as required.

Jaynes argued that the formula for the continuous entropy should be derived by taking the limit of increasingly dense discrete distributions.^[1]^[2] Suppose that we have a set of $N$ discrete points $\{x_{i}\}$ , such that in the limit $N\to \infty$ their density approaches a function $m(x)$ called the "invariant measure":

\lim _{N\to \infty }{\frac {1}{N}}\,({\mbox{number of points in }}a<x<b)=\int _{a}^{b}m(x)\,dx.

Jaynes derived from this the following formula for the continuous entropy, which he argued should be taken as the correct formula:

\lim _{N\rightarrow \infty }H_{N}(X)=\log(N)-\int p(x)\log {\frac {p(x)}{m(x)}}\,dx.

Typically, when this is written, the term $\log(N)$ is omitted, as that would typically not be finite. So the actual common definition is

H(X)=-\int p(x)\log {\frac {p(x)}{m(x)}}\,dx.

Where it is unclear whether or not the $\log(N)$ term should be omitted, one could write

H_{N}(X)\sim \log(N)+H(X).

Notice that in Jaynes' formula, $m(x)$ is a probability density. For any finite $N$ that $m(x)$ ^{[further explanation needed]} is a uniform density over the quantization of the continuous space that is used in the Riemann sum. In the limit, $m(x)$ is the continuous limiting density of points in the quantization used to represent the continuous variable $x$ .

Suppose one had a number format that took on $N$ possible values, distributed as per $m(x)$ . Then $H_{N}(X)$ (if $N$ is large enough that the continuous approximation is valid) is the discrete entropy of the variable $x$ in this encoding. This is equal to the average number of bits required to transmit this information, and is no more than $\log(N)$ . Therefore, $H(X)$ may be thought of as the amount of information gained by knowing that the variable $x$ follows the distribution $p(x)$ , and is not uniformly distributed over the possible quantized values, as would be the case if it followed $m(x)$ . $H(X)$ is actually the (negative) Kullback–Leibler divergence from $m(x)$ to $p(x)$ , which is thought of as the information gained by learning that a variable previously thought to be distributed as $m(x)$ is actually distributed as $p(x)$ .

Jaynes' continuous entropy formula has the property of being invariant under a change of variables, provided that $m(x)$ and $p(x)$ are transformed in the same way. (This motivates the name "invariant measure" for m.) This solves many of the difficulties that come from applying Shannon's continuous entropy formula. Jaynes himself dropped the $\log(N)$ term as it was not relevant to his work (maximum entropy distributions), and it is somewhat awkward to have an infinite term in the calculation. Unfortunately, this cannot be helped if the quantization is made arbitrarily fine, as would be the case in the continuous limit. Note that $H(X)$ as defined here (without the $\log(N)$ term) would always be non-positive, because a KL divergence would always be non-negative.

If it is the case that $m(x)$ is constant over some interval of size $r$ , and $p(x)$ is essentially zero outside that interval, then the limiting density of discrete points (LDDP) is closely related to the differential entropy $h(X)$ :

H_{N}(X)\approx \log(N)-\log(r)+h(X).

References

^ Jaynes, E. T. (1963). "Information Theory and Statistical Mechanics". In K. Ford (ed.). Statistical Physics (PDF). Benjamin, New York. p. 181.
^ Jaynes, E. T. (1968). "Prior Probabilities" (PDF). IEEE Transactions on Systems Science and Cybernetics. SSC-4 (3): 227–241. doi:10.1109/TSSC.1968.300117.

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Definition

References

Further reading