Part of a series on statistics 
Probability theory 

In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of an unspecified person from within a group would be a random vector. Normally each element of a random vector is a real number.
Random vectors are often used as the underlying implementation of various types of aggregate random variables, e.g. a random matrix, random tree, random sequence, stochastic process, etc.
More formally, a multivariate random variable is a column vector (or its transpose, which is a row vector) whose components are scalarvalued random variables on the same probability space as each other, , where is the sample space, is the sigmaalgebra (the collection of all events), and is the probability measure (a function returning each event's probability).
YouTube Encyclopedic

1/5Views:154 61513 65452 65551 819575

✪ Lecture 19: Joint, Conditional, and Marginal Distributions  Statistics 110

✪ (PP 5.1) Multiple discrete random variables

✪ Mod01 Lec10 Multivariate normal distribution

✪ Lecture 30: ChiSquare, Studentt, Multivariate Normal  Statistics 110

✪ Multivariate Transformations Part 1
Transcription
Okay? So last time we were talking about joint distributions. And just to kinda quickly remind everyone I like the big theme right now is joint, conditional, and marginal distributions. And everyone needs to get comfortable at how all those concepts relate. So there's three different types of things. Joint, conditional, and marginal. And we were talking about joint and marginal distributions last time. Not so much about conditional distributions. But it's analagous to the stuff we've already seen about conditioning. So those are the three key words. Joint, conditional, and marginal distributions. So at this point in the course we pretty much have all the tools we need for working with one random variable at a time. But there's much much more that we need to study about what happens when we have two random variables. Or a list, a sequence of random variables. Things like that. A sum of a million random variables, and things like that. So we're gonna talk a lot about what happens with lots of random variables at the same time. And that's why I keep emphasizing that everything is accumulative here. Because if you have trouble with one random variable and it's CDF then understanding two of them at the same time is gonna be very difficult. So we always have a joint CDF. If there's two of them. I'll just write down what it looks like, F(x,y). So joint CDF would be this, for two random variables. But of course, if we had a million random variables instead of two, I'm not gonna write this down. I could write x 1 through x a million. And then, x 1, less than or equal, little x 1, and so on. And a million of them. So this extends to as many as you want. But it's just easier to write it down and think about it for two of them. But it's more general than this. That's the joint CDF that always makes sense. They can be discrete, continuous, mixtures of discrete and continuous or anything. In the continuous case then we have a joint PDF which I talked a little bit about. But I don't think I wrote down how to get from the joint CDF to the joint PDF. So then we have a joint PDF. And if it's analogous to the onedimensional case. Where in the one dimensional case we take the derivative of the CDF to get the PDF. In this case, we take the derivative except that it's a function of two variables. So, we're gonna take two partial derivatives. And so I would write it as d squared of d squared, dx/dy F(x,y). Which looks complicated, especially if you haven't seen partial derivatives. But even if you haven't ever done partial derivatives before there's nothing really to worry about with this. All it means is take the derivative, this is a function of two variables. Take the derivative with respect to y, treating x as a constant, right? So if you can do derivatives, which I'm assuming you can do, you can pretend x is a constant. And then take the derivative with respect to x, holding y as a constant. And there's a theorem in multivariable calculus that says that under some mild conditions, it doesn't actually matter if you take the partial with respect to y then with respect to x. Or with respect to x and then with respect to y, you'll get the same thing. So this is again analogous to the onedimensional case. And the joint PDF, this is not a probability, that's a density. That's all we integrate to get a density. So integrate this. If we want to know what's the probability that x, y is in some set A? Then that's just gonna be the integral over that set A of the density. If you haven't done double intervals before, again, it's no big deal. Just integrate with respect to x, holding y constant, and then integrate it with respect to y. Basically, the only complicated thing is figuring out the limits of integration. So I just wrote double interval over A, cuz A could be any region in the plane. So if we had something like, if A is this blob, that may be a hard problem to this integral. What does it mean to integrate over the blob? I mean, that turns into a nasty multivariable calculus problem. That's not something we care about for this course. It's just a nasty calculus problem. It's not an interesting probability problem. So the more interesting case for our purposes would be if it's, let's call that A1. Down here's the A we actually want. If it's a rectangle, then this double integral just means integrate x goes from here to here, y goes from here to here. So it's just literally the integral of the integral so it's no different from doing two integrals. So you don't have to worry too much about the blobs. There's only one case where we might care about the blobs in this course. And that's when we have a uniform distribution over some region. So I'll come back to this. So at the very end last time we were talking about a distribution that's uniform over a square or over a circle, that kinda thing. And in the uniform case, we can interpret probability as proportional to area. So in the uniform case probability is proportional to area. And then I could say well, I'm just going to do something proportional to the area of the blob. And at least I can think more geometrically. But anyway, conceptually it's analogous. The joint PDF is what we integrate to get the probability of any of xy being in any particular set, right? In one dimension we'd say, what's the probability of x is between 3 and 5, right? We want an interval. And here we want the probability that it's in some region. But the rectangular case is gonna be the nicest one. So, those are joint distributions. And I talked a little bit last time about how to get them marginal. And it's very straight forward. Marginal PDF of x. To get the marginal PDF of x, we just integrate out the y. So we just integrate minus infinity to infinity, f (xy) dy. Notice that by doing this, we'll get something that's now a function of x. X is just treated as a constant here we are integrating overall y this is no longer gonna depend on y because you're integrating overall y becomes a dummy variable. Similarly you get the marginal PDF of y by integrating dx. And this is just completely analogous to doing a summing over the cases. It's just saying we want x to be this little x. And y has to be something. So we just integrate over all possibilities. So that's the marginal. So that's called marginalization. We marginalized out the y. Then we get the marginal of that. We integrate out the y, we get the marginal of x. It's just terminology for something very simple. Just integrate. If we did a double integral. So if we then took this thing and integrate this dx, we should get 1. And what that says, one way to think of it is to say if we let A be the entire plane, everything, we'd better get 1, right? Otherwise it wouldn't make any sense. The other way to think of it is this is supposed to be the density of x, just viewed as x in its own. So if we integrate this dx, we have to get 1, otherwise do not find a valid marginal PDF. So that has to integrate to 1. May as well write that down just for emphasis, the double integral equals 1. And it's always minus infinity to infinity, minus infinity to infinity to start with. It might be that this is zero outside of some region, and then we could restrict it further. But we could always write it like this at first, and then we should be careful about where is it zero or where is it nonzero. So that's gonna be our marginal, let's do a conditional. Conditional distribution, so we want conditional PDF. And this should be easy to understand and remember, because it's analogous to conditioning we've done before. So let's say we want the conditional PDF of YX, well, we would just write that as f. Sometimes we'd put a subscript of YX just for emphasis. And sometimes we may leave out the subscript, just cuz it's clear from the context. Conditional PDF It's just, think of it as the PDF where we get to pretend that we know what X is. We get to observe what x is, okay? Given that information, that we now know the value of x, what is the appropriate PDF for y? Well, we could think of that as being the joint density, divided by the marginal density of x. What I just wrote down just looks like the definition of conditional probability, right? The probability of this given this is the probability of this and this, divided by the probability of this thing. Now x and y are representing numbers, not events, okay? But it looks the same as the definition of conditional probability, and you can derive this from the definition of conditional probability. Where basically what you would do is say, our event is that y is either, take y = Y. Or if we are worried about probability zero, say y is extremely close to Y. That is, we let capital Y be in some tiny little interval around little y and find the conditional probability of that, given the value of x. And it's completely analogous to conditional probability. So this says that we can get the conditional just by doing the joint distribution, joint density divided by the marginal density. We could also do something that looks like Bayes' rule. That is, what if we want the conditional PDF of YX? Well, we want fXY(xy) fY(y), I'm just writing down something that looks like Bayes' rule, That looks like Bayes' rule, right? I swapped the x and the y, but instead of probability I'm doing density, completely analogous to Bayes' rule. The proof is really, use Bayes' rule and then take a limit, and so this should be easy to remember. And the numerator is the same, another way to say this is that to get the joint density, we can take one of the marginals, then times the other conditional, right? That's like, if we're pretending it's probability rather than density. It's like the probability of this y value times the probability of the x value, given that y value. So everything is analagous to Bayes' rule in the discrete case. All right, so those are just the basic concepts we need for that. And I should mention again how to think of independence, so again, this is the continuous case. X and Y are independent if, well, the general definition in terms of CDFs, but it's usually easier to work with the PDF than the CDF. So usually, the best way to think of it is independent means that the joint PDF is the product of the marginal PDFs. And that has to hold for all x and y. It's not too hard to show that that's equivalent to having the CDFs factor. Cuz basically, if the CDFs factor, you could take the derivative, this derivative thing, and you'll get this. You could take this thing, and integrate, and go back there, so it's basically equivalent. Intuitively, it should be equivalent. All right, so let's come back to this uniform example. Because I wanted to write what the conditional, we wrote down the joint PDF last time, I'll remind you. That is, we have the distribution that was uniform on a circle, or inside the disc. So uniform in the disc, which is x squared + y squared less than or equal to 1. We are picking a uniformly random point, maybe there. Uniform means that probability of some region is proportional to area, okay? So therefore, so one nice thing when we have problems that involve a uniform distribution on some region in the plane. We can actually think of probability in terms of area, or at least it's proportional to area. So the joint PDF we did last time Is just 1 over pi, it's one over the area, because that'll make it integrate to 1. Within the circle, x squared + y squared less than or equal to 1, and 0 outside, okay? But just for practice, let's get the marginal density of x and then the conditional density, xy or yx. So, and by the way, this may look like they're independent because this doesn't, this looks like somehow it factors as a constant times a constant. But x and y are not independent here, right? Because it x is very close to 1, then it's constraining the values of y, so they're definitely dependent. You have to be careful about things like that. Cuz if you just only look at the 1 over pi, it looks like they might be independent. But the key thing is that they are constrained together to be, right? So this is saying that x and y are actually closely related. But if you only look at this part and ignore this part, you might think they're independent. All right, so let's get the marginal, fx(x), all we have to do integrate out the y. So we're gonna integrate the joint PDF, which is 1 over pi, as long as we're careful to, we're gonna integrate this thing, dy. The only thing we have to be careful about is the limits of integration. This is only valid when x squared + y squared is less than or equal to 1. Which is the same thing as saying that y squared is less than or equal to 1 x squared. And that tells us that y has to be between minus square root of this and plus square root of this. So we're gonna integrate from minus square root 1 x squared, to square root of 1 x squared. So the main mistake with this kind of problem is messing up the limits of integration somehow. We have to be very, very careful with limits of integration. You're not actually ever gonna have to do any difficult integral in this course. But sometimes, you have to think carefully about the limits of integration, okay? So this is just saying, these are the bounds on y for which I should have 1 over pi rather than 0 here, okay? So if we get the limits of integration wrong, then it's just Just completely wrong. All right, this is a very easy integral. This integral of a constant is just the constant times the length of the interval. So that's just 2/pi square root of 1 x squared. And that's valid for 1 less than or = x less than or = 1. As a check, we could integrate this thing, dx and, How do you actually integrate the square root of one minus x squared? You would do a trick substitution. I'm not gonna do that integral right now, but you could integrate this thing from minus one to one, use a trick substitution as he just suggested. That's basically gonna reduce it back down to the fact that it's based on a circle, and you'll get one. So that does integrate to one. So that's the marginal, notice that this does not look like a uniform, so it's certainly false to say that it's uniform between minus one and one. The point xy is uniform, but the marginals are not uniform, right, and in fact you can see that this is largest when x is 0 which kinda makes sense. Cuz if you imagine the random point here, then kind of near the center seems like there's more space for stuff to happen and seems a little less likely to be further out, okay? So let's get the conditional PDF now. All right so we can either do y given x or x given y whichever we feel like. Notice that if you want the marginal PDF of y, just changed the letter x to y here by symmetry no need to repeat the same calculation. Okay, so let's do the PDF of y given x. So that's just gonna be be the joint PDF divided by the marginal PDF of x. So it's just gonna be 1/pi/2/pi, square root 1 x squared. I just took the joint PDF divided by the marginal PDF, and we have to be careful about where is this nonzero. I'm thinking of y as fixed right now, it's like we get to observe x and I wanna say well, what are the possible values of y? Well, for each x, we know that y has to be between, square root of 1 the same thing again. y has to be between root 1 x squared, okay, 0 otherwise. So the pi's cancel, and, That looks kind of ugly. What would be another way to say what this conditional density is? You're treating x as a constant. What would you call this thing? Uniform, because notice this only has a x here, there's no y. And general you would have a y here. There's no letter y on the right side of this equation. So a nicer way to write this would be to say that y given X is uniform between ( root 1 X squared, root 1 x squared) because this is just a constant for each fixed x. I wrote this with capital X here to clarify this notation. When you see this thing like y given capital X. What does that mean? Intuitively that means just pretend that capital X, we know x is a random variable but pretend capital X is a known constant cuz we got to observe it. But you can just think of this as short hand for saying, Y given X = x. This kind of is a more direct way to write it that is we get to observe that X = x. And we're saying that if we know that then we have a uniform distribution, between ( square root of 1 x squared, square root of 1 x squared). But its a little more cumbersome to write it this way. So sometimes I'll write it this way with capital x but just treat that as short hand for this. That just means given that we get to know what x is, here's the distribution for y. So we're treating x as a constant here, and here we're explicitly calling that constant little x, it's just notation. So okay, so that says it's conditionally uniform over some interval. Notice that that's the appropriate interval. Cuz as soon as you specify what x is, we know y has to be between here and here. This says it's uniform. So similarly you could do f of x given y. And you can see that they're not independent, because well one way to see it, is, fx, y does not equal the product of the marginal PDFs in general here, right? Take this thing and then the same thing with y, you multiply them you do not get the joint PDF. So they're not independent. Another way to say they're not independent is that the conditional distribution of y given x is not the same thing as the unconditional distribution of y. That is learning x gives us information, okay? All right, so those are these basic concepts, joint, conditional, marginal. I wanted to mention one more thing that's analogous to the onedimensional case. And that's what I call the 2D LOTUS. And it's completely analogous. So we wanna do LOTUS where we have a function of more than one variable. So, let's let (x,y) have a joint PDF. I'll state it in the continuous case, but you could also do a discrete 2D LOTUS if you want. So we have a joint PDF, f(x,y) okay, and then just let g be any function of xy. Let's say it's real valued. So this function g, takes two values as input and outputs one value. For example it could just be x plus y, or it could be x squared times sine of x,y, cubed or whatever. Just any function of x,y, okay? A realvalued function of x,y. And then we're gonna write down LOTUS. LOTUS tells us how to get the expected value of g(X, Y), and it says, we do not need to try to find the PDF of g(x, y), we can work directly in terms of the joint PDF. And all we have to do is integrate, It's gonna be minus infinity to infinity minus infinity to infinity. But possibly we can narrow it down that range of I just change capital X,Y to lowercase x,y. And then I use the joint PDF. Completely analogous. So let's do a couple of examples, how is this fact useful? So here is an important fact that I already needed this fact once and we didn't improve it yet, which was like we were talking about the fact that the MGF of a sum of independent random variables is the product of the MGFs. And at some point we need to say E of something times something is E of something, E of the other thing. That's true when they're independent, that's what we need to show right now. So the theorem is that if X and Y are independent. Then E of XY equals E of X E of Y, that's a very useful fact. Well, we'll come back to this fact later when we talk about correlation, the way we would say this in words is that independent implies uncorrelated, that's just foreshadowing. Later we'll talk more about what exactly does correlation mean. But, when we define correlation, in a later lecture, we'll see that that actually says that they're uncorrelated. And so this independent implies uncorrelated, is the way to say it in words. So let's prove this fact. And this is always true. It doesn't matter if they're continuous or discrete or whatever. But so we don't have to invent a lot of notations or do a lot of cases, let's just do a continuous case for practice. So proof in the continuous case Well, we're just gonna use the 2D LOTUS. That saves us a lot of effort, because when you just see this thing E of X. X times Y is a random variable in its own right. So the first time you see this, you might think I need to study that random variable. That takes a lot of work. 2D LOTUS just says that's the function of XY I'm just gonna use LOTUS and then it's gonna be easy. So E of XY equals, how do we do this? Well, I'll just write down double integral minus infinity to infinity, minus infinity to infinity xy times the joint PDF. But since we assumed that they're independent, the joint PDF is just the product of the marginal PDFs. So independence means the joint PDF just factors like that. And that's what makes this, actually, easy to deal with. Because this function is just separated out like, this is a function of X function of Y. Function of X, function of Y. Very nice. So, now, what do we actually do? Well, what this is to do is take this. I'll put parentheses here to make it a little clearer what this double integral means. That's just the definition of this double integral. It says do this integral, then to this outer integral, so you work your way out. When you're doing this inner integral you're treating Y as a constant, so this Y you're gonna stick it right there. And this fy of y also stick that there. Both of those come out. So that look a little messy, let's rewrite that. All I did was to take out the y and the marginal PDF of y. And what's left is the x and the marginal PDF of x. So I just took them out. Now this whole thing here, that's just a number. That just says we took this function and we integrated it over x and we get a number. That's just a constant. And we know what constant that is, that's E of X. That's just a number. So that constant you can pull out of this entire integral. It's just a constant, take it out of the integral. What's left? Integral of Y times the PDF of Y. That's just E of Y. So that's immediately just E of X E of Y. So basically this amounts to E of X E of Y. All this amounts to doing is just taking out things that you're treating as constant and then the factors. So that's a useful fact. And it would be a nightmare to try to prove this without having LOTUS available. But with LOTUS then we can do that pretty quickly. All right, there's another problem I like to do with the 2D LOTUS. And that's like expected distance between two points. So let's start with the uniform case. I talked about this on the strategic practice too, you can look at that later. But I think this is a useful point for everyone to see this now. So we have, so this is an example, where we take two uniforms, let's let them be X and Y, B i.i.d. uniform 0, 1. And we wanna find expected distance between them. So, this kind of problem comes a lot in applications where you have two random points. And often you wanna know how far apart they are. So, this is used for various applications. And so one approach would be, try to study absolute value of X minus Y, find it's distribution and maybe for some problems we need the distribution. But in this case, they said I only want the mean, so, therefore LOTUS should suffice for that. So just write down LOTUS So it's a double integral, x minus y. And since they're i.i.d uniformed, the PDF is just 1. The joint PDF is just 1. So you just have to integrate this thing, dxdy from 0 to 1, 0 to 1. So, Then the only question I guess is how do we integrate the absolute value? Well usually if you wanna integrate an absolute value, the best strategy would be to split the integral into pieces such that you can get rid of the absolute value. So we could split this up as one piece where x is greater than y. I'll write it this way. X greater than y. That is I'm integrating over the set of all points in 0, 1 where x is greater than y of this function. Now if x is greater than y, I can just drop the absolute value plus and now integrate over the piece where x is less than or equal to y. And in that case, it's y minus x not x minus y. Now if you think of the symmetry of the problem, this problem is completely symmetrical because of the i.i.d. And this is symmetrical function, I could have changed this to y minus x. So, really there is no point in doing two double integrals, let's just do one and double integral, and double it. And then we have to do two integrals instead of four, so that's much nicer. This is just gonna be 2 times the first integral I wrote down. Okay, I am not gonna do a lot of double integrals in class and you won't have to do many double integral in general in this course. We will have to do a couple of them. So just for practice, let's do this. Basically, the only thing you could mess up is the limits of integration. So let's carefully say, how do we get the correct limits of integration here? The outer limits, I could've done dydx and then it'll be different limits of integration. Okay, but I chose, for no particular reason, to just write it as dxdy. So if we write dxdy, the outer limits must refer to y. And we know y goes from 0 to 1. Okay, now the inner limits, so these outer limits have to just be numbers. But as you move inward, the limits can start depending on other variables, so these inner limits can depend on y. In fact, they have to depend on y. Okay, it would not work to go 0 to 1 here. We know x has to be between 0 and 1. But we also know that we're only integrating over x greater than y, so x has to be greater than y. So we go from y to 1. Right, because x is bigger than y so it has to start at y, so that's all we have to do. We do have to be careful. It's easy to mess up the limits of integration. Now this just says do 2 easy integrals, okay? So it's integral 0 to 1, this inner integral, Inner integral, I integrate x y dx so I'm treating y as a constant, so I would just do x squared over 2 yx. All right, I'm treating y as a constant and then evaluate this from y to 1. Okay, and so then we just plug in 1 and subtract, plug in y. And then it's just a very, very easy integral. And I won't bore you with all the algebra for that. You just plug in 1, plug in y, and it's integrating a very easy integral. If you simplify that you get onethird. So the average distance between two uniforms is onethird. Let's draw a little picture to see whether that makes intuitive sense to us. So we have this interval 0 to 1, okay? And we're picking 2 uniformly random points in this interval. Let's say there and there, completely random. But notice that the distance between then is onethird because that's onethird, twothirds, the distance is onethird. That sort of looks like your stereotypical, if you had to guess something what would it look like, that might be what you would guess, right, and it works. This actually, for me at least, this actually makes the result onethird easy to remember, even though that's not a proof, obviously. >> [LAUGH] >> But that actually does suggest another way to look at this problem. Which is, I'm picking these two random points, and there's gonna be a point on the left and a point on the right. So that suggests reinterpreting this in terms of the max and the min. So another way to look at this would be to let, let's say M = maximum of x,y. And you should think through for yourself. Why is the maximum random variable as a random variable. That's just basic practice with your random variables. L, this is something that always annoys me is that the word maximum and the word minimum both start with m, so it's hard to remember your notation. So I started using L for the minimum because L stands for the least one or L stands for the little one, but unfortunately, then I realized L could also stand for the large one. >> [LAUGH] >> It's just one annoying fact about English. >> [LAUGH] >> Well, anyway, we'll let M be the max and L me the min. Here's a handy fact, XY absolute value the same thing as ML, right. Because you take the bigger one minus the smaller one, that's the same thing as taking in the absolute difference, same thing, right. That's how you do an absolute value, you just take the bigger one minus the smaller one. So therefore, what we've just shown is that E of ML = onethird, according to that calculation. So that says that E of ME of L = onethird. And on the other hand, sorry, I should have written this up higher. I'm gonna go loop around to the top here. So the difference of the expectations is onethird. Let's also look at the sum. If we look at E of M+L, Well, by linearity, that's E of M + E of L. But on the other hand, what's M + L in terms of X and Y? It's just X + Y because if you add the bigger number plus the smaller number, all you've done is add the two numbers, right? M + L is the same thing as X + Y But by linearity, E of X + Y is E of X + E of Y and both of those are onehalf cuz they're uniform 0 to 1, so this must = 1. We just showed that that is = 1. So from this, we actually now have an expression for the sum of these two expectations and the difference. So therefore, we can just solve that and we get E of M and E of L. So, E of M = twothirds, and E of L, I have a system of two equations and two unknowns, just solve that the usual way, add the two equations, that kind of thing. E of L = onethird, just like in this picture. That's L, that's M, On average. So on average, it looks like that. So another approach to this problem would have been, if I used this result to prove this result, another approach would have been, let's directly study the max and the min, okay. And you've seen examples like on the strategic practice like very useful factors that the minimum of independent exponentials is exponential with a larger rate. And we showed that on the strategic practice problem. And on the new strategic practice, there's something related with the min and the max. So another way to do this would have been directly find the PDF of M, the PDF of L, and that would give us this result, right. So you could go in either direction. I actually don't like doing double integrals, I'm not gonna do a lot of double integrals, you won't have to do many. I felt I should do one for practice with the 2D LOTUS. Okay, but in general, I would rather think more in this way. Use linearity, use the CDFs, the things like that and not do a lot of integrals. Okay, so those are continuous examples. I want to do one discrete example for the rest of today. And then next time, we'll also do some more discrete stuff and maybe some more continuous stuff too. This is one of my favorite discrete problems. I call it the chicken and egg problem. Chickenegg, We already had a homework problem about chickens and eggs, and hatchings and so on. But it's not exactly the same as this problem, it's related. So here's the problem. I'll state the problem, then we'll solve the problem, and then we'll be done. >> [LAUGH] >> Okay, here's the problem. There are some eggs, some of them hatch, some of them don't hatch. The eggs are independent. Let's assume there are N eggs. The twist to this problem is that the number of eggs is random, chicken doesn't always lay exactly the same number of eggs. So let's assume that it's Poisson Lambda. That's the number of eggs, now each one either hatches or fails to hatch, so each hatches with probability p And independently, so you can think of each egg as an independent Bernoulli p trial for whether it hatches or not. Hatching is success. So independently, and let X equal the number that hatch, So I would write that as X given N is binomial Np. That's just a restatement we already know this. As I explained this notation means pretend that N is a known constant, actually N is Poisson, but pretend that now we know the number of eggs. So we're treating N as a constant then just binomial Np because I assumed independent Bernoulli trials. So still we know that. Okay, let's also let Y equal to number that don't hatch. So we have an identity X plus Y equals N. All right, well, that's not the end yet, but we derived the theorem that X plus Y equals N, because the number that hatched plus the number that don't hatch equals the number of eggs. Now the problem is to find the joint PMF. It's discrete so I could find the joint PMF of X and Y. And in particular we'd like to know, are they independent? And intuitively, they seem extremely dependent because their sum must equal N. That's not a proof though because this proof that they're conditionally dependent. That is if we know N they are dependent, and intuitively they seem pretty dependent. That is if you have a lot of eggs that hatch then there's not so many left that don't hatch. But we haven't yet proven whether they are independent of not cuz this is equal N. So now let's find the joint PMF. So just by definition the joint PMF is the probability that X equals something, lets say i, Y equals j, could use little x little y. But I'm just using i and j to remind us that they are integers. Now, to do that, somehow we have to bring in this Poisson thing. So our strategy for solving this should be going back to our early part of the course. You have a probability, if you don't immediately know how to do it, try to find something to condition on. What do you condition on? What we wish that we knew. I wish I knew the number of eggs. Then it's an easy binomial problem. Conditional on the number of eggs, just a binomial. So we're gonna condition on N. The law of total probability says we can just write this as the sum of X equals i, Y equals j, given that N equals n times the probability that N equals n. Summed over all n from zero to infinity. That's just the total probability. And probably N equals n, we already know that from the Poisson. Well, okay, that looks a little scary like we're gonna have to do an infinite sum. For similar problems I seen a lot of students get stuck at this point. And my suggestion is if you ever find yourself getting stuck at a point like this is to try some simple examples, make up some numbers, do some special cases so you think about it more concretely rather than being intimidated by this infinite series. If you actually think about it concretely, you'll notice something very, very simple. That is, if I said, what is the probability of this, this is just kind of some scratch work. What's the probability that X equals 3, Y equals 5, given N equals 10? What's that? 0, because there's 10 eggs, 3 hatched, 5 didn't hatch, someone stole the other two eggs, I mean it doesn't make any sense, that it's impossible, 0. What's the probability that X equals 3, Y equals 5, given N equals 2? 0, There's only two eggs and yet you're claiming three hatched and five didn't hatch, that makes no sense. So as soon as you write down, I find writing down a few simple numbers like that it becomes completely obvious that this incident sums up actually only one term. The one term is the case when in fact N equals i plus j. So we only have one term here. X equals i, Y equals j, given N equals i plus j. Otherwise there's a mismatch. Times the probability that N equals i plus j. And now we know everything we need to know to just evaluate this. Notice there's now some redundancy, because if I know there's i plus j eggs and i hatched, I already know that j hatched. You didn't have to tell me that. Redundant information, we just cross that out. Now X equals i given N equal, that's just from the binomial. Cuz given the value of N we're treating X as binomial so we're just gonna take something from the binomial PMF times something from the Poisson PMF. And so let's see, that board is broken, so we can do this here still, just have a little more space. So we want to find the probability that X equals i, given N equals i plus j, times the probability that N equals i plus j. Okay, this is just, this is an easy calculation now, but let's see what the answer is. For the first term we just use the binomial. So i plus j choose i. I'll write that as i plus j over i factorial, j factorial. That's just i plus j choose i. Then, That's a factorial, thank you. So that's i plus j, thanks, that's i plus j choose i. From the binomial. Times p to the i, times. Because we're assuming binomial Np. So p to the i, and q, as usual, q is one minus p. So i successes, j failures, q to the j, and then times the Poisson PMF, e to the minus lambda, lambda to the i plus j over i plus j factorial. Let's just simplify this quickly. i plus j factorials cancel, and let's try to write this in a nicer looking form. Where we are going to try to split it up into a function of i times a function of j. So we could write this as lambda to the i plus j we can split that up so really we have lambda p to the i over i factorial. And we have a lambda q to the j over j factorial. And the only thing left that we have to deal with, is this e to the minus lambda. But remember, that p plus q equals 1. So I can think of it as having a p plus q sitting up in the exponent. So this is e to the minus, lambda p. And this e to the minus lambda q. So actually it factored, so actually that shows that they are independent. That says that X and Y are independent. And X is also Poisson X is Poisson lambda p, and Y is Poisson lambda q. Which sounds like impossible at first how could they be independent? And if your intuition was that they're not independent, you shouldn't feel bad about that because it turns out that this is only true for the Poisson. So this is actually a very special property of the Poisson. If you change Poisson to anything else they will become dependent. It happens to be true for the Poisson, we just proved that they're independent. That is you think well, you have more eggs that hatched, there is less that didn't hatch, but the number of eggs is random and that randomness exactly for the Poisson exactly makes them independent. Well, let's just get an example of a joint PMF. It's also a nice story. And have a good weekend.
Contents
Probability distribution
Every random vector gives rise to a probability measure on with the Borel algebra as the underlying sigmaalgebra. This measure is also known as the joint probability distribution, the joint distribution, or the multivariate distribution of the random vector.
The distributions of each of the component random variables are called marginal distributions. The conditional probability distribution of given is the probability distribution of when is known to be a particular value.
The cumulative distribution function of a random vector is defined as^{[1]}^{:p.15}

(Eq.1) 
where .
Operations on random vectors
Random vectors can be subjected to the same kinds of algebraic operations as can nonrandom vectors: addition, subtraction, multiplication by a scalar, and the taking of inner products.
Affine transformations
Similarly, a new random vector can be defined by applying an affine transformation to a random vector :
 , where is an matrix and is an column vector.
If is an invertible matrix and has a probability density function , then the probability density of is
 .
Invertible mappings
More generally we can study invertible mappings of random vectors.^{[2]}^{:p.290–291}
Let be a onetoone mapping from an open subset of onto a subset of , let have continuous partial derivatives in and let the Jacobian determinant of be zero at no point of . Assume that the real random vector has a probability density function and satisfies . Then the random vector is of probability density
where denotes the indicator function and set denotes support of .
Expected value
The expected value or mean of a random vector is a fixed vector whose elements are the expected values of the respective random variables.^{[3]}^{:p.333}

(Eq.2) 
Covariance and crosscovariance
Definitions
The covariance matrix (also called second central moment or variancecovariance matrix) of an random vector is an matrix whose (i,j)^{th} element is the covariance between the i^{ th} and the j^{ th} random variables. The covariance matrix is the expected value, element by element, of the matrix computed as , where the superscript T refers to the transpose of the indicated vector:^{[2]}^{:p. 464}^{[3]}^{:p.335}

(Eq.3) 
By extension, the crosscovariance matrix between two random vectors and ( having elements and having elements) is the matrix^{[3]}^{:p.336}

(Eq.4) 
where again the matrix expectation is taken elementbyelement in the matrix. Here the (i,j)^{th} element is the covariance between the i^{ th} element of and the j^{ th} element of .
Properties
The covariance matrix is a symmetric matrix, i.e.^{[2]}^{:p. 466}
 .
The covariance matrix is a positive semidefinite matrix, i.e.^{[2]}^{:p. 465}
 .
The crosscovariance matrix is simply the transpose of the matrix , i.e.
 .
Two random vectors and are called uncorrelated if
 .
They are uncorrelated if and only if their crosscovariance matrix is zero.^{[3]}^{:p.337}
Correlation and crosscorrelation
Definitions
The correlation matrix (also called second moment) of an random vector is an matrix whose (i,j)^{th} element is the correlation between the i^{ th} and the j^{ th} random variables. The correlation matrix is the expected value, element by element, of the matrix computed as , where the superscript T refers to the transpose of the indicated vector^{[4]}^{:p.190}^{[3]}^{:p.334}:

(Eq.5) 
By extension, the crosscorrelation matrix between two random vectors and ( having elements and having elements) is the matrix

(Eq.6) 
Properties
The correlation matrix is related to the covariance matrix by
 .
Similarly for the crosscorrelation matrix and the crosscovariance matrix:
Orthogonality
Two random vectors of the same size and are called orthogonal if
 .
Independence
Two random vectors and are called independent if for all and
where and denote the cumulative distribution functions of and and denotes their joint cumulative distribution function. Independence of and is often denoted by . Written componentwise, and are called independent if for all
 .
Characteristic function
The characteristic function of a random vector with components is a function that maps every vector to a complex rumber. It is defined by^{[2]}^{:p. 468}
 .
Further properties
Expectation of a quadratic form
One can take the expectation of a quadratic form in the random vector as follows:^{[5]}^{:p.170–171}
where is the covariance matrix of and refers to the trace of a matrix — that is, to the sum of the elements on its main diagonal (from upper left to lower right). Since the quadratic form is a scalar, so is its expectation.
Proof: Let be an random vector with and and let be an nonstochastic matrix.
Then based on the formula for the covariance, if we denote and , we see that:
Hence
which leaves us to show that
This is true based on the fact that one can cyclically permute matrices when taking a trace without changing the end result (e.g.: ).
We see that
And since
is a scalar, then
trivially. Using the permutation we get:
and by plugging this into the original formula we get:
Expectation of the product of two different quadratic forms
One can take the expectation of the product of two different quadratic forms in a zeromean Gaussian random vector as follows:^{[5]}^{:pp. 162–176}
where again is the covariance matrix of . Again, since both quadratic forms are scalars and hence their product is a scalar, the expectation of their product is also a scalar.
Applications
Portfolio theory
In portfolio theory in finance, an objective often is to choose a portfolio of risky assets such that the distribution of the random portfolio return has desirable properties. For example, one might want to choose the portfolio return having the lowest variance for a given expected value. Here the random vector is the vector of random returns on the individual assets, and the portfolio return p (a random scalar) is the inner product of the vector of random returns with a vector w of portfolio weights — the fractions of the portfolio placed in the respective assets. Since p = w^{T}, the expected value of the portfolio return is w^{T}E() and the variance of the portfolio return can be shown to be w^{T}Cw, where C is the covariance matrix of .
Regression theory
In linear regression theory, we have data on n observations on a dependent variable y and n observations on each of k independent variables x_{j}. The observations on the dependent variable are stacked into a column vector y; the observations on each independent variable are also stacked into column vectors, and these latter column vectors are combined into a design matrix X (not denoting a random vector in this context) of observations on the independent variables. Then the following regression equation is postulated as a description of the process that generated the data:
where β is a postulated fixed but unknown vector of k response coefficients, and e is an unknown random vector reflecting random influences on the dependent variable. By some chosen technique such as ordinary least squares, a vector is chosen as an estimate of β, and the estimate of the vector e, denoted , is computed as
Then the statistician must analyze the properties of and , which are viewed as random vectors since a randomly different selection of n cases to observe would have resulted in different values for them.
Vector time series
The evolution of a k×1 random vector through time can be modelled as a vector autoregression (VAR) as follows:
where the iperiodsback vector observation is called the ith lag of , c is a k × 1 vector of constants (intercepts), A_{i} is a timeinvariant k × k matrix and is a k × 1 random vector of error terms.
References
 ^ Gallager, Robert G. (2013). Stochastic Processes Theory for Applications. Cambridge University Press. ISBN 9781107039759.
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Lapidoth, Amos, A foundation in Digital Communication, Cambridge University Press, 2009.
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} Gubner, John A. (2006). Probability and Random Processes for Electrical and Computer Engineers. Cambridge University Press. ISBN 9780521864701.
 ^ Papoulis, Athanasius, Probability, Random variables and Stochastic processes, McGrawHill, 1991
 ^ ^{a} ^{b} Kendrick, David, Stochastic Control for Economic Models, McGrawHill, 1981.