Part of a series on statistics 
Probability theory 

Given random variables , that are defined on a probability space, the joint probability distribution for is a probability distribution that gives the probability that each of falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.
The joint probability distribution can be expressed either in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables). These in turn can be used to find two other types of distributions: the marginal distribution giving the probabilities for any one of the variables with no reference to any specific ranges of values for the other variables, and the conditional probability distribution giving the probabilities for any subset of the variables conditional on particular values of the remaining variables.
YouTube Encyclopedic

1/5Views:133 7656 3068653 13520 911

✪ Lecture 19: Joint, Conditional, and Marginal Distributions  Statistics 110

✪ Joint Probability Density Function Joint PDF/Properties of Joint PDF/Joint Probability Distribution

✪ L09.10 Joint CDFs

✪ Joint Random Variables Part 1

✪ Mod01 Lec18 Joint Distributions  I
Transcription
Okay? So last time we were talking about joint distributions. And just to kinda quickly remind everyone I like the big theme right now is joint, conditional, and marginal distributions. And everyone needs to get comfortable at how all those concepts relate. So there's three different types of things. Joint, conditional, and marginal. And we were talking about joint and marginal distributions last time. Not so much about conditional distributions. But it's analagous to the stuff we've already seen about conditioning. So those are the three key words. Joint, conditional, and marginal distributions. So at this point in the course we pretty much have all the tools we need for working with one random variable at a time. But there's much much more that we need to study about what happens when we have two random variables. Or a list, a sequence of random variables. Things like that. A sum of a million random variables, and things like that. So we're gonna talk a lot about what happens with lots of random variables at the same time. And that's why I keep emphasizing that everything is accumulative here. Because if you have trouble with one random variable and it's CDF then understanding two of them at the same time is gonna be very difficult. So we always have a joint CDF. If there's two of them. I'll just write down what it looks like, F(x,y). So joint CDF would be this, for two random variables. But of course, if we had a million random variables instead of two, I'm not gonna write this down. I could write x 1 through x a million. And then, x 1, less than or equal, little x 1, and so on. And a million of them. So this extends to as many as you want. But it's just easier to write it down and think about it for two of them. But it's more general than this. That's the joint CDF that always makes sense. They can be discrete, continuous, mixtures of discrete and continuous or anything. In the continuous case then we have a joint PDF which I talked a little bit about. But I don't think I wrote down how to get from the joint CDF to the joint PDF. So then we have a joint PDF. And if it's analogous to the onedimensional case. Where in the one dimensional case we take the derivative of the CDF to get the PDF. In this case, we take the derivative except that it's a function of two variables. So, we're gonna take two partial derivatives. And so I would write it as d squared of d squared, dx/dy F(x,y). Which looks complicated, especially if you haven't seen partial derivatives. But even if you haven't ever done partial derivatives before there's nothing really to worry about with this. All it means is take the derivative, this is a function of two variables. Take the derivative with respect to y, treating x as a constant, right? So if you can do derivatives, which I'm assuming you can do, you can pretend x is a constant. And then take the derivative with respect to x, holding y as a constant. And there's a theorem in multivariable calculus that says that under some mild conditions, it doesn't actually matter if you take the partial with respect to y then with respect to x. Or with respect to x and then with respect to y, you'll get the same thing. So this is again analogous to the onedimensional case. And the joint PDF, this is not a probability, that's a density. That's all we integrate to get a density. So integrate this. If we want to know what's the probability that x, y is in some set A? Then that's just gonna be the integral over that set A of the density. If you haven't done double intervals before, again, it's no big deal. Just integrate with respect to x, holding y constant, and then integrate it with respect to y. Basically, the only complicated thing is figuring out the limits of integration. So I just wrote double interval over A, cuz A could be any region in the plane. So if we had something like, if A is this blob, that may be a hard problem to this integral. What does it mean to integrate over the blob? I mean, that turns into a nasty multivariable calculus problem. That's not something we care about for this course. It's just a nasty calculus problem. It's not an interesting probability problem. So the more interesting case for our purposes would be if it's, let's call that A1. Down here's the A we actually want. If it's a rectangle, then this double integral just means integrate x goes from here to here, y goes from here to here. So it's just literally the integral of the integral so it's no different from doing two integrals. So you don't have to worry too much about the blobs. There's only one case where we might care about the blobs in this course. And that's when we have a uniform distribution over some region. So I'll come back to this. So at the very end last time we were talking about a distribution that's uniform over a square or over a circle, that kinda thing. And in the uniform case, we can interpret probability as proportional to area. So in the uniform case probability is proportional to area. And then I could say well, I'm just going to do something proportional to the area of the blob. And at least I can think more geometrically. But anyway, conceptually it's analogous. The joint PDF is what we integrate to get the probability of any of xy being in any particular set, right? In one dimension we'd say, what's the probability of x is between 3 and 5, right? We want an interval. And here we want the probability that it's in some region. But the rectangular case is gonna be the nicest one. So, those are joint distributions. And I talked a little bit last time about how to get them marginal. And it's very straight forward. Marginal PDF of x. To get the marginal PDF of x, we just integrate out the y. So we just integrate minus infinity to infinity, f (xy) dy. Notice that by doing this, we'll get something that's now a function of x. X is just treated as a constant here we are integrating overall y this is no longer gonna depend on y because you're integrating overall y becomes a dummy variable. Similarly you get the marginal PDF of y by integrating dx. And this is just completely analogous to doing a summing over the cases. It's just saying we want x to be this little x. And y has to be something. So we just integrate over all possibilities. So that's the marginal. So that's called marginalization. We marginalized out the y. Then we get the marginal of that. We integrate out the y, we get the marginal of x. It's just terminology for something very simple. Just integrate. If we did a double integral. So if we then took this thing and integrate this dx, we should get 1. And what that says, one way to think of it is to say if we let A be the entire plane, everything, we'd better get 1, right? Otherwise it wouldn't make any sense. The other way to think of it is this is supposed to be the density of x, just viewed as x in its own. So if we integrate this dx, we have to get 1, otherwise do not find a valid marginal PDF. So that has to integrate to 1. May as well write that down just for emphasis, the double integral equals 1. And it's always minus infinity to infinity, minus infinity to infinity to start with. It might be that this is zero outside of some region, and then we could restrict it further. But we could always write it like this at first, and then we should be careful about where is it zero or where is it nonzero. So that's gonna be our marginal, let's do a conditional. Conditional distribution, so we want conditional PDF. And this should be easy to understand and remember, because it's analogous to conditioning we've done before. So let's say we want the conditional PDF of YX, well, we would just write that as f. Sometimes we'd put a subscript of YX just for emphasis. And sometimes we may leave out the subscript, just cuz it's clear from the context. Conditional PDF It's just, think of it as the PDF where we get to pretend that we know what X is. We get to observe what x is, okay? Given that information, that we now know the value of x, what is the appropriate PDF for y? Well, we could think of that as being the joint density, divided by the marginal density of x. What I just wrote down just looks like the definition of conditional probability, right? The probability of this given this is the probability of this and this, divided by the probability of this thing. Now x and y are representing numbers, not events, okay? But it looks the same as the definition of conditional probability, and you can derive this from the definition of conditional probability. Where basically what you would do is say, our event is that y is either, take y = Y. Or if we are worried about probability zero, say y is extremely close to Y. That is, we let capital Y be in some tiny little interval around little y and find the conditional probability of that, given the value of x. And it's completely analogous to conditional probability. So this says that we can get the conditional just by doing the joint distribution, joint density divided by the marginal density. We could also do something that looks like Bayes' rule. That is, what if we want the conditional PDF of YX? Well, we want fXY(xy) fY(y), I'm just writing down something that looks like Bayes' rule, That looks like Bayes' rule, right? I swapped the x and the y, but instead of probability I'm doing density, completely analogous to Bayes' rule. The proof is really, use Bayes' rule and then take a limit, and so this should be easy to remember. And the numerator is the same, another way to say this is that to get the joint density, we can take one of the marginals, then times the other conditional, right? That's like, if we're pretending it's probability rather than density. It's like the probability of this y value times the probability of the x value, given that y value. So everything is analagous to Bayes' rule in the discrete case. All right, so those are just the basic concepts we need for that. And I should mention again how to think of independence, so again, this is the continuous case. X and Y are independent if, well, the general definition in terms of CDFs, but it's usually easier to work with the PDF than the CDF. So usually, the best way to think of it is independent means that the joint PDF is the product of the marginal PDFs. And that has to hold for all x and y. It's not too hard to show that that's equivalent to having the CDFs factor. Cuz basically, if the CDFs factor, you could take the derivative, this derivative thing, and you'll get this. You could take this thing, and integrate, and go back there, so it's basically equivalent. Intuitively, it should be equivalent. All right, so let's come back to this uniform example. Because I wanted to write what the conditional, we wrote down the joint PDF last time, I'll remind you. That is, we have the distribution that was uniform on a circle, or inside the disc. So uniform in the disc, which is x squared + y squared less than or equal to 1. We are picking a uniformly random point, maybe there. Uniform means that probability of some region is proportional to area, okay? So therefore, so one nice thing when we have problems that involve a uniform distribution on some region in the plane. We can actually think of probability in terms of area, or at least it's proportional to area. So the joint PDF we did last time Is just 1 over pi, it's one over the area, because that'll make it integrate to 1. Within the circle, x squared + y squared less than or equal to 1, and 0 outside, okay? But just for practice, let's get the marginal density of x and then the conditional density, xy or yx. So, and by the way, this may look like they're independent because this doesn't, this looks like somehow it factors as a constant times a constant. But x and y are not independent here, right? Because it x is very close to 1, then it's constraining the values of y, so they're definitely dependent. You have to be careful about things like that. Cuz if you just only look at the 1 over pi, it looks like they might be independent. But the key thing is that they are constrained together to be, right? So this is saying that x and y are actually closely related. But if you only look at this part and ignore this part, you might think they're independent. All right, so let's get the marginal, fx(x), all we have to do integrate out the y. So we're gonna integrate the joint PDF, which is 1 over pi, as long as we're careful to, we're gonna integrate this thing, dy. The only thing we have to be careful about is the limits of integration. This is only valid when x squared + y squared is less than or equal to 1. Which is the same thing as saying that y squared is less than or equal to 1 x squared. And that tells us that y has to be between minus square root of this and plus square root of this. So we're gonna integrate from minus square root 1 x squared, to square root of 1 x squared. So the main mistake with this kind of problem is messing up the limits of integration somehow. We have to be very, very careful with limits of integration. You're not actually ever gonna have to do any difficult integral in this course. But sometimes, you have to think carefully about the limits of integration, okay? So this is just saying, these are the bounds on y for which I should have 1 over pi rather than 0 here, okay? So if we get the limits of integration wrong, then it's just Just completely wrong. All right, this is a very easy integral. This integral of a constant is just the constant times the length of the interval. So that's just 2/pi square root of 1 x squared. And that's valid for 1 less than or = x less than or = 1. As a check, we could integrate this thing, dx and, How do you actually integrate the square root of one minus x squared? You would do a trick substitution. I'm not gonna do that integral right now, but you could integrate this thing from minus one to one, use a trick substitution as he just suggested. That's basically gonna reduce it back down to the fact that it's based on a circle, and you'll get one. So that does integrate to one. So that's the marginal, notice that this does not look like a uniform, so it's certainly false to say that it's uniform between minus one and one. The point xy is uniform, but the marginals are not uniform, right, and in fact you can see that this is largest when x is 0 which kinda makes sense. Cuz if you imagine the random point here, then kind of near the center seems like there's more space for stuff to happen and seems a little less likely to be further out, okay? So let's get the conditional PDF now. All right so we can either do y given x or x given y whichever we feel like. Notice that if you want the marginal PDF of y, just changed the letter x to y here by symmetry no need to repeat the same calculation. Okay, so let's do the PDF of y given x. So that's just gonna be be the joint PDF divided by the marginal PDF of x. So it's just gonna be 1/pi/2/pi, square root 1 x squared. I just took the joint PDF divided by the marginal PDF, and we have to be careful about where is this nonzero. I'm thinking of y as fixed right now, it's like we get to observe x and I wanna say well, what are the possible values of y? Well, for each x, we know that y has to be between, square root of 1 the same thing again. y has to be between root 1 x squared, okay, 0 otherwise. So the pi's cancel, and, That looks kind of ugly. What would be another way to say what this conditional density is? You're treating x as a constant. What would you call this thing? Uniform, because notice this only has a x here, there's no y. And general you would have a y here. There's no letter y on the right side of this equation. So a nicer way to write this would be to say that y given X is uniform between ( root 1 X squared, root 1 x squared) because this is just a constant for each fixed x. I wrote this with capital X here to clarify this notation. When you see this thing like y given capital X. What does that mean? Intuitively that means just pretend that capital X, we know x is a random variable but pretend capital X is a known constant cuz we got to observe it. But you can just think of this as short hand for saying, Y given X = x. This kind of is a more direct way to write it that is we get to observe that X = x. And we're saying that if we know that then we have a uniform distribution, between ( square root of 1 x squared, square root of 1 x squared). But its a little more cumbersome to write it this way. So sometimes I'll write it this way with capital x but just treat that as short hand for this. That just means given that we get to know what x is, here's the distribution for y. So we're treating x as a constant here, and here we're explicitly calling that constant little x, it's just notation. So okay, so that says it's conditionally uniform over some interval. Notice that that's the appropriate interval. Cuz as soon as you specify what x is, we know y has to be between here and here. This says it's uniform. So similarly you could do f of x given y. And you can see that they're not independent, because well one way to see it, is, fx, y does not equal the product of the marginal PDFs in general here, right? Take this thing and then the same thing with y, you multiply them you do not get the joint PDF. So they're not independent. Another way to say they're not independent is that the conditional distribution of y given x is not the same thing as the unconditional distribution of y. That is learning x gives us information, okay? All right, so those are these basic concepts, joint, conditional, marginal. I wanted to mention one more thing that's analogous to the onedimensional case. And that's what I call the 2D LOTUS. And it's completely analogous. So we wanna do LOTUS where we have a function of more than one variable. So, let's let (x,y) have a joint PDF. I'll state it in the continuous case, but you could also do a discrete 2D LOTUS if you want. So we have a joint PDF, f(x,y) okay, and then just let g be any function of xy. Let's say it's real valued. So this function g, takes two values as input and outputs one value. For example it could just be x plus y, or it could be x squared times sine of x,y, cubed or whatever. Just any function of x,y, okay? A realvalued function of x,y. And then we're gonna write down LOTUS. LOTUS tells us how to get the expected value of g(X, Y), and it says, we do not need to try to find the PDF of g(x, y), we can work directly in terms of the joint PDF. And all we have to do is integrate, It's gonna be minus infinity to infinity minus infinity to infinity. But possibly we can narrow it down that range of I just change capital X,Y to lowercase x,y. And then I use the joint PDF. Completely analogous. So let's do a couple of examples, how is this fact useful? So here is an important fact that I already needed this fact once and we didn't improve it yet, which was like we were talking about the fact that the MGF of a sum of independent random variables is the product of the MGFs. And at some point we need to say E of something times something is E of something, E of the other thing. That's true when they're independent, that's what we need to show right now. So the theorem is that if X and Y are independent. Then E of XY equals E of X E of Y, that's a very useful fact. Well, we'll come back to this fact later when we talk about correlation, the way we would say this in words is that independent implies uncorrelated, that's just foreshadowing. Later we'll talk more about what exactly does correlation mean. But, when we define correlation, in a later lecture, we'll see that that actually says that they're uncorrelated. And so this independent implies uncorrelated, is the way to say it in words. So let's prove this fact. And this is always true. It doesn't matter if they're continuous or discrete or whatever. But so we don't have to invent a lot of notations or do a lot of cases, let's just do a continuous case for practice. So proof in the continuous case Well, we're just gonna use the 2D LOTUS. That saves us a lot of effort, because when you just see this thing E of X. X times Y is a random variable in its own right. So the first time you see this, you might think I need to study that random variable. That takes a lot of work. 2D LOTUS just says that's the function of XY I'm just gonna use LOTUS and then it's gonna be easy. So E of XY equals, how do we do this? Well, I'll just write down double integral minus infinity to infinity, minus infinity to infinity xy times the joint PDF. But since we assumed that they're independent, the joint PDF is just the product of the marginal PDFs. So independence means the joint PDF just factors like that. And that's what makes this, actually, easy to deal with. Because this function is just separated out like, this is a function of X function of Y. Function of X, function of Y. Very nice. So, now, what do we actually do? Well, what this is to do is take this. I'll put parentheses here to make it a little clearer what this double integral means. That's just the definition of this double integral. It says do this integral, then to this outer integral, so you work your way out. When you're doing this inner integral you're treating Y as a constant, so this Y you're gonna stick it right there. And this fy of y also stick that there. Both of those come out. So that look a little messy, let's rewrite that. All I did was to take out the y and the marginal PDF of y. And what's left is the x and the marginal PDF of x. So I just took them out. Now this whole thing here, that's just a number. That just says we took this function and we integrated it over x and we get a number. That's just a constant. And we know what constant that is, that's E of X. That's just a number. So that constant you can pull out of this entire integral. It's just a constant, take it out of the integral. What's left? Integral of Y times the PDF of Y. That's just E of Y. So that's immediately just E of X E of Y. So basically this amounts to E of X E of Y. All this amounts to doing is just taking out things that you're treating as constant and then the factors. So that's a useful fact. And it would be a nightmare to try to prove this without having LOTUS available. But with LOTUS then we can do that pretty quickly. All right, there's another problem I like to do with the 2D LOTUS. And that's like expected distance between two points. So let's start with the uniform case. I talked about this on the strategic practice too, you can look at that later. But I think this is a useful point for everyone to see this now. So we have, so this is an example, where we take two uniforms, let's let them be X and Y, B i.i.d. uniform 0, 1. And we wanna find expected distance between them. So, this kind of problem comes a lot in applications where you have two random points. And often you wanna know how far apart they are. So, this is used for various applications. And so one approach would be, try to study absolute value of X minus Y, find it's distribution and maybe for some problems we need the distribution. But in this case, they said I only want the mean, so, therefore LOTUS should suffice for that. So just write down LOTUS So it's a double integral, x minus y. And since they're i.i.d uniformed, the PDF is just 1. The joint PDF is just 1. So you just have to integrate this thing, dxdy from 0 to 1, 0 to 1. So, Then the only question I guess is how do we integrate the absolute value? Well usually if you wanna integrate an absolute value, the best strategy would be to split the integral into pieces such that you can get rid of the absolute value. So we could split this up as one piece where x is greater than y. I'll write it this way. X greater than y. That is I'm integrating over the set of all points in 0, 1 where x is greater than y of this function. Now if x is greater than y, I can just drop the absolute value plus and now integrate over the piece where x is less than or equal to y. And in that case, it's y minus x not x minus y. Now if you think of the symmetry of the problem, this problem is completely symmetrical because of the i.i.d. And this is symmetrical function, I could have changed this to y minus x. So, really there is no point in doing two double integrals, let's just do one and double integral, and double it. And then we have to do two integrals instead of four, so that's much nicer. This is just gonna be 2 times the first integral I wrote down. Okay, I am not gonna do a lot of double integrals in class and you won't have to do many double integral in general in this course. We will have to do a couple of them. So just for practice, let's do this. Basically, the only thing you could mess up is the limits of integration. So let's carefully say, how do we get the correct limits of integration here? The outer limits, I could've done dydx and then it'll be different limits of integration. Okay, but I chose, for no particular reason, to just write it as dxdy. So if we write dxdy, the outer limits must refer to y. And we know y goes from 0 to 1. Okay, now the inner limits, so these outer limits have to just be numbers. But as you move inward, the limits can start depending on other variables, so these inner limits can depend on y. In fact, they have to depend on y. Okay, it would not work to go 0 to 1 here. We know x has to be between 0 and 1. But we also know that we're only integrating over x greater than y, so x has to be greater than y. So we go from y to 1. Right, because x is bigger than y so it has to start at y, so that's all we have to do. We do have to be careful. It's easy to mess up the limits of integration. Now this just says do 2 easy integrals, okay? So it's integral 0 to 1, this inner integral, Inner integral, I integrate x y dx so I'm treating y as a constant, so I would just do x squared over 2 yx. All right, I'm treating y as a constant and then evaluate this from y to 1. Okay, and so then we just plug in 1 and subtract, plug in y. And then it's just a very, very easy integral. And I won't bore you with all the algebra for that. You just plug in 1, plug in y, and it's integrating a very easy integral. If you simplify that you get onethird. So the average distance between two uniforms is onethird. Let's draw a little picture to see whether that makes intuitive sense to us. So we have this interval 0 to 1, okay? And we're picking 2 uniformly random points in this interval. Let's say there and there, completely random. But notice that the distance between then is onethird because that's onethird, twothirds, the distance is onethird. That sort of looks like your stereotypical, if you had to guess something what would it look like, that might be what you would guess, right, and it works. This actually, for me at least, this actually makes the result onethird easy to remember, even though that's not a proof, obviously. >> [LAUGH] >> But that actually does suggest another way to look at this problem. Which is, I'm picking these two random points, and there's gonna be a point on the left and a point on the right. So that suggests reinterpreting this in terms of the max and the min. So another way to look at this would be to let, let's say M = maximum of x,y. And you should think through for yourself. Why is the maximum random variable as a random variable. That's just basic practice with your random variables. L, this is something that always annoys me is that the word maximum and the word minimum both start with m, so it's hard to remember your notation. So I started using L for the minimum because L stands for the least one or L stands for the little one, but unfortunately, then I realized L could also stand for the large one. >> [LAUGH] >> It's just one annoying fact about English. >> [LAUGH] >> Well, anyway, we'll let M be the max and L me the min. Here's a handy fact, XY absolute value the same thing as ML, right. Because you take the bigger one minus the smaller one, that's the same thing as taking in the absolute difference, same thing, right. That's how you do an absolute value, you just take the bigger one minus the smaller one. So therefore, what we've just shown is that E of ML = onethird, according to that calculation. So that says that E of ME of L = onethird. And on the other hand, sorry, I should have written this up higher. I'm gonna go loop around to the top here. So the difference of the expectations is onethird. Let's also look at the sum. If we look at E of M+L, Well, by linearity, that's E of M + E of L. But on the other hand, what's M + L in terms of X and Y? It's just X + Y because if you add the bigger number plus the smaller number, all you've done is add the two numbers, right? M + L is the same thing as X + Y But by linearity, E of X + Y is E of X + E of Y and both of those are onehalf cuz they're uniform 0 to 1, so this must = 1. We just showed that that is = 1. So from this, we actually now have an expression for the sum of these two expectations and the difference. So therefore, we can just solve that and we get E of M and E of L. So, E of M = twothirds, and E of L, I have a system of two equations and two unknowns, just solve that the usual way, add the two equations, that kind of thing. E of L = onethird, just like in this picture. That's L, that's M, On average. So on average, it looks like that. So another approach to this problem would have been, if I used this result to prove this result, another approach would have been, let's directly study the max and the min, okay. And you've seen examples like on the strategic practice like very useful factors that the minimum of independent exponentials is exponential with a larger rate. And we showed that on the strategic practice problem. And on the new strategic practice, there's something related with the min and the max. So another way to do this would have been directly find the PDF of M, the PDF of L, and that would give us this result, right. So you could go in either direction. I actually don't like doing double integrals, I'm not gonna do a lot of double integrals, you won't have to do many. I felt I should do one for practice with the 2D LOTUS. Okay, but in general, I would rather think more in this way. Use linearity, use the CDFs, the things like that and not do a lot of integrals. Okay, so those are continuous examples. I want to do one discrete example for the rest of today. And then next time, we'll also do some more discrete stuff and maybe some more continuous stuff too. This is one of my favorite discrete problems. I call it the chicken and egg problem. Chickenegg, We already had a homework problem about chickens and eggs, and hatchings and so on. But it's not exactly the same as this problem, it's related. So here's the problem. I'll state the problem, then we'll solve the problem, and then we'll be done. >> [LAUGH] >> Okay, here's the problem. There are some eggs, some of them hatch, some of them don't hatch. The eggs are independent. Let's assume there are N eggs. The twist to this problem is that the number of eggs is random, chicken doesn't always lay exactly the same number of eggs. So let's assume that it's Poisson Lambda. That's the number of eggs, now each one either hatches or fails to hatch, so each hatches with probability p And independently, so you can think of each egg as an independent Bernoulli p trial for whether it hatches or not. Hatching is success. So independently, and let X equal the number that hatch, So I would write that as X given N is binomial Np. That's just a restatement we already know this. As I explained this notation means pretend that N is a known constant, actually N is Poisson, but pretend that now we know the number of eggs. So we're treating N as a constant then just binomial Np because I assumed independent Bernoulli trials. So still we know that. Okay, let's also let Y equal to number that don't hatch. So we have an identity X plus Y equals N. All right, well, that's not the end yet, but we derived the theorem that X plus Y equals N, because the number that hatched plus the number that don't hatch equals the number of eggs. Now the problem is to find the joint PMF. It's discrete so I could find the joint PMF of X and Y. And in particular we'd like to know, are they independent? And intuitively, they seem extremely dependent because their sum must equal N. That's not a proof though because this proof that they're conditionally dependent. That is if we know N they are dependent, and intuitively they seem pretty dependent. That is if you have a lot of eggs that hatch then there's not so many left that don't hatch. But we haven't yet proven whether they are independent of not cuz this is equal N. So now let's find the joint PMF. So just by definition the joint PMF is the probability that X equals something, lets say i, Y equals j, could use little x little y. But I'm just using i and j to remind us that they are integers. Now, to do that, somehow we have to bring in this Poisson thing. So our strategy for solving this should be going back to our early part of the course. You have a probability, if you don't immediately know how to do it, try to find something to condition on. What do you condition on? What we wish that we knew. I wish I knew the number of eggs. Then it's an easy binomial problem. Conditional on the number of eggs, just a binomial. So we're gonna condition on N. The law of total probability says we can just write this as the sum of X equals i, Y equals j, given that N equals n times the probability that N equals n. Summed over all n from zero to infinity. That's just the total probability. And probably N equals n, we already know that from the Poisson. Well, okay, that looks a little scary like we're gonna have to do an infinite sum. For similar problems I seen a lot of students get stuck at this point. And my suggestion is if you ever find yourself getting stuck at a point like this is to try some simple examples, make up some numbers, do some special cases so you think about it more concretely rather than being intimidated by this infinite series. If you actually think about it concretely, you'll notice something very, very simple. That is, if I said, what is the probability of this, this is just kind of some scratch work. What's the probability that X equals 3, Y equals 5, given N equals 10? What's that? 0, because there's 10 eggs, 3 hatched, 5 didn't hatch, someone stole the other two eggs, I mean it doesn't make any sense, that it's impossible, 0. What's the probability that X equals 3, Y equals 5, given N equals 2? 0, There's only two eggs and yet you're claiming three hatched and five didn't hatch, that makes no sense. So as soon as you write down, I find writing down a few simple numbers like that it becomes completely obvious that this incident sums up actually only one term. The one term is the case when in fact N equals i plus j. So we only have one term here. X equals i, Y equals j, given N equals i plus j. Otherwise there's a mismatch. Times the probability that N equals i plus j. And now we know everything we need to know to just evaluate this. Notice there's now some redundancy, because if I know there's i plus j eggs and i hatched, I already know that j hatched. You didn't have to tell me that. Redundant information, we just cross that out. Now X equals i given N equal, that's just from the binomial. Cuz given the value of N we're treating X as binomial so we're just gonna take something from the binomial PMF times something from the Poisson PMF. And so let's see, that board is broken, so we can do this here still, just have a little more space. So we want to find the probability that X equals i, given N equals i plus j, times the probability that N equals i plus j. Okay, this is just, this is an easy calculation now, but let's see what the answer is. For the first term we just use the binomial. So i plus j choose i. I'll write that as i plus j over i factorial, j factorial. That's just i plus j choose i. Then, That's a factorial, thank you. So that's i plus j, thanks, that's i plus j choose i. From the binomial. Times p to the i, times. Because we're assuming binomial Np. So p to the i, and q, as usual, q is one minus p. So i successes, j failures, q to the j, and then times the Poisson PMF, e to the minus lambda, lambda to the i plus j over i plus j factorial. Let's just simplify this quickly. i plus j factorials cancel, and let's try to write this in a nicer looking form. Where we are going to try to split it up into a function of i times a function of j. So we could write this as lambda to the i plus j we can split that up so really we have lambda p to the i over i factorial. And we have a lambda q to the j over j factorial. And the only thing left that we have to deal with, is this e to the minus lambda. But remember, that p plus q equals 1. So I can think of it as having a p plus q sitting up in the exponent. So this is e to the minus, lambda p. And this e to the minus lambda q. So actually it factored, so actually that shows that they are independent. That says that X and Y are independent. And X is also Poisson X is Poisson lambda p, and Y is Poisson lambda q. Which sounds like impossible at first how could they be independent? And if your intuition was that they're not independent, you shouldn't feel bad about that because it turns out that this is only true for the Poisson. So this is actually a very special property of the Poisson. If you change Poisson to anything else they will become dependent. It happens to be true for the Poisson, we just proved that they're independent. That is you think well, you have more eggs that hatched, there is less that didn't hatch, but the number of eggs is random and that randomness exactly for the Poisson exactly makes them independent. Well, let's just get an example of a joint PMF. It's also a nice story. And have a good weekend.
Contents
Examples
Draws from an urn
Suppose each of two urns contains twice as many red balls as blue balls, and no others, and suppose one ball is randomly selected from each urn, with the two draws independent of each other. Let and be discrete random variables associated with the outcomes of the draw from the first urn and second urn respectively. The probability of drawing a red ball from either of the urns is 2/3, and the probability of drawing a blue ball is 1/3. We can present the joint probability distribution as the following table:
A=Red  A=Blue  P(B)  

B=Red  (2/3)(2/3)=4/9  (1/3)(2/3)=2/9  4/9+2/9=2/3 
B=Blue  (2/3)(1/3)=2/9  (1/3)(1/3)=1/9  2/9+1/9=1/3 
P(A)  4/9+2/9=2/3  2/9+1/9=1/3 
Each of the four inner cells shows the probability of a particular combination of results from the two draws; these probabilities are the joint distribution. In any one cell the probability of a particular combination occurring is (since the draws are independent) the product of the probability of the specified result for A and the probability of the specified result for B. The probabilities in these four cells sum to 1, as it is always true for probability distributions.
Moreover, the final row and the final column give the marginal probability distribution for A and the marginal probability distribution for B respectively. For example, for A the first of these cells gives the sum of the probabilities for A being red, regardless of which possibility for B in the column above the cell occurs, as 2/3. Thus the marginal probability distribution for gives 's probabilities unconditional on , in a margin of the table.
Coin flips
Consider the flip of two fair coins; let and be discrete random variables associated with the outcomes of the first and second coin flips respectively. Each coin flip is a Bernoulli trial and has a Bernoulli distribution. If a coin displays "heads" then the associated random variable takes the value 1, and it takes the value 0 otherwise. The probability of each of these outcomes is 1/2, so the marginal (unconditional) density functions are
The joint probability density function of and defines probabilities for each pair of outcomes. All possible outcomes are
Since each outcome is equally likely the joint probability density function becomes
Since the coin flips are independent, the joint probability density function is the product of the marginals:
Roll of a die
Consider the roll of a fair die and let if the number is even (i.e. 2, 4, or 6) and otherwise. Furthermore, let if the number is prime (i.e. 2, 3, or 5) and otherwise.
1  2  3  4  5  6  

A  0  1  0  1  0  1 
B  0  1  1  0  1  0 
Then, the joint distribution of and , expressed as a probability mass function, is
These probabilities necessarily sum to 1, since the probability of some combination of and occurring is 1.
Bivariate normal distribution
The multivariate normal distribution, which is a continuous distribution, is the most commonly encountered distribution in statistics. When there are specifically two random variables, this is the bivariate normal distribution, shown in the graph, with the possible values of the two variables plotted in two of the dimensions and the value of the density function for any pair of such values plotted in the third dimension. The probability that the two variables together fall in any region of their two dimensions is given by the volume under the density function above that region.
Joint cumulative distribution function
For a pair of random variables , the joint cumulative distribution function (CDF) is given by^{[1]}^{:p. 89}


(Eq.1) 
where the righthand side represents the probability that the random variable takes on a value less than or equal to and that takes on a value less than or equal to .
For random variables , the joint CDF is given by


(Eq.2) 
Interpreting the random variables as a random vector yields a shorter notation:
Joint density function or mass function
Discrete case
The joint probability mass function of two discrete random variables is:


(Eq.3) 
or written in term of conditional distributions
where is the probability of given that .
The generalization of the preceding twovariable case is the joint probability distribution of discrete random variables which is:


(Eq.4) 
or equivalently
 .
This identity is known as the chain rule of probability.
Since these are probabilities, we have in the twovariable case
which generalizes for discrete random variables to
Continuous case
The joint probability density function for two continuous random variables is defined as the derivative of the joint cumulative distribution function (see Eq.1):


(Eq.5) 
This is equal to:
where and are the conditional distributions of given and of given respectively, and and are the marginal distributions for and respectively.
The definition extends naturally to more than two random variables:


(Eq.6) 
Again, since these are probability distributions, one has
respectively
Mixed case
The "mixed joint density" may be defined where one or more random variables are continuous and the other random variables are discrete. With one variable of each type we have
One example of a situation in which one may wish to find the cumulative distribution of one random variable which is continuous and another random variable which is discrete arises when one wishes to use a logistic regression in predicting the probability of a binary outcome Y conditional on the value of a continuously distributed outcome . One must use the "mixed" joint density when finding the cumulative distribution of this binary outcome because the input variables were initially defined in such a way that one could not collectively assign it either a probability density function or a probability mass function. Formally, is the probability density function of with respect to the product measure on the respective supports of and . Either of these two decompositions can then be used to recover the joint cumulative distribution function:
The definition generalizes to a mixture of arbitrary numbers of discrete and continuous random variables.
Additional properties
Joint distribution for independent variables
In general two random variables and are independent if the joint cumulative distribution function satisfies
Two discrete random variables and are independent if the joint probability mass function satisfies
for all and .
While the number of independent random events grows, the related joint probability value decreases rapidly to zero, according to a negative exponential law.
Similarly, two absolutely continuous random variables are independent if
for all and . This means that acquiring any information about the value of one or more of the random variables leads to a conditional distribution of any other variable that is identical to its unconditional (marginal) distribution; thus no variable provides any information about any other variable.
Joint distribution for conditionally dependent variables
If a subset of the variables is conditionally dependent given another subset of these variables, then the probability mass function of the joint distribution is . is equal to . Therefore, it can be efficiently represented by the lowerdimensional probability distributions and . Such conditional independence relations can be represented with a Bayesian network or copula functions.
Important named distributions
Named joint distributions that arise frequently in statistics include the multivariate normal distribution, the multivariate stable distribution, the multinomial distribution, the negative multinomial distribution, the multivariate hypergeometric distribution, and the elliptical distribution.
See also
 Bayesian programming
 Chow–Liu tree
 Conditional probability
 Copula (probability theory)
 Disintegration theorem
 Multivariate statistics
 Statistical interference
References
 ^ Park,Kun Il (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer. ISBN 9783319680743.
External links
 Hazewinkel, Michiel, ed. (2001) [1994], "Joint distribution", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 9781556080104
 Hazewinkel, Michiel, ed. (2001) [1994], "Multidimensional distribution", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 9781556080104
 "Joint continuous density function". PlanetMath.
 Mathworld: Joint Distribution Function