In probability theory, especially in mathematical statistics, a location–scale family is a family of probability distributions parametrized by a location parameter and a nonnegative scale parameter. For any random variable whose probability distribution function belongs to such a family, the distribution function of also belongs to the family (where means "equal in distribution"—that is, "has the same distribution as"). Moreover, if and are two random variables whose distribution functions are members of the family, and assuming 1) existence of the first two moments and 2) has zero mean and unit variance, then can be written as , where and are the mean and standard deviation of .
In other words, a class of probability distributions is a location–scale family if for all cumulative distribution functions and any real numbers and , the distribution function is also a member of .
 If has a cumulative distribution function , then has a cumulative distribution function .
 If is a discrete random variable with probability mass function , then is a discrete random variable with probability mass function .
 If is a continuous random variable with probability density function , then is a continuous random variable with probability density function .
In decision theory, if all alternative distributions available to a decisionmaker are in the same location–scale family, and the first two moments are finite, then a twomoment decision model can apply, and decisionmaking can be framed in terms of the means and the variances of the distributions.^{[1]}^{[2]}^{[3]}
YouTube Encyclopedic

1/5Views:31 0631 5561 152744466

Lecture 14: Location, Scale, and LOTUS  Statistics 110

Generalized Gamma Distribution

Lecture 03: Probability Distributions and the Exponential Family

# 92 Exponential family member Gamma distribution, proof

Statgraphics Generalized Logistic Distribution
Transcription
So last time we were talking about standard normal, right? Normal zero one. So just a few quick facts that we proved last time. So our notation is, traditionally it's often called Z, but I'm not saying Z has to be standard normal. Or you have to call standard normal Z, just we often use letter Z for that. If Z is standard normal, then first of all, we found its PDF, right? We figured out the normalizing constant,it's CDF. It's CDF you can't actually do in closed form, so therefore it's just called capital Phi. That's just the standard notation for the CDF. We computed the mean and the variance last time. Remember, the mean E of Z = 0. That's just immediate by symmetry. Then we also did the variance. Variance equals in this case the variance is E of Z squared equals 1. Cuz variance of E of Z squared minus E of Z squared the other way, but that's 0, so that's one. That we computed last time using integration by parts, so we did that last time. And if we wanted, this is by the way it's called the first moment, second moment. If we wanted E of Z cubed, this we didn't talk about last time. That's gonna be 0 again. Because, I'll just write down what it would be. By LOTUS, we would have the integral minus infinity, infinity 1 over root 2 pi, E to the minus Z squared over 2 dz. This integrates 1, that would be just integrating the PDF. And LOTUS says if we want E of Z cubed, we just stick in a Z cubed here. If we just wanted to do E of Z, we'd put Z, if we want E of Z cubed, we'd put Z cubed, that's LOTUS. But this is just equal to 0, because this is an odd function, again. So we talked about that in this case, but the same argument would apply here for Z cubed. Similarly, for any odd power here, 5, 7, and so on, we'll immediately get 0. So this is called the third moment. At some point later in the semester we can talk about where does the word moment come from. But that's just that's just terminology for that that's called the third moment. E of Z cubed that would be called the second moment first moment then and so on. Okay so in other words, by symmetry we already know that all the odd moments of the normal are 0. The even moments well, we have this the second one if we wanted E of Z to the fourth. Well it's going to be the integral except put Z to the fourth instead of Z cubed, then that's not such an easy integral anymore,okay? And it's not an integral that you need to know how to do it at this point, we'll probably come back to how to do things like that later, not before the midterm though. But at least you should immediately know LOTUS that you could write down the integral for E of Z to the fourth, it's just that happens to be an integral that I don't expect that anyone could do right now. But at least you could write down the integral, okay? Odd moments though, you just immediately get 0 by symmetry, no integrals needed. Okay so I was talking about symmetry, let me just mention symmetry one other way which is that minus Z is also standard normal. And that's just another way to express the symmetry of it. That is, the PDF is this bell curve that's symmetrical about 0. So if you flip, this flips between plus and minus, right. Just flipping the sign, that changes the random variable, it makes a positive into negative, makes negative into positive. But it does not change the distribution, that's what the symmetry says. So you can either just see this by symmetry or you could compute the PDF of this by first find the CDF, then find the PDF, and you'll see that that's true. That's a very useful fact., it's always useful looking for symmetries. Okay, so this is just stuff about the standard normal. But now we wanna introduce what happens with normal where this is not necessarily 0, 1, okay? So this is the general normal. We say that X, if we let X equal mu plus sigma Z where mu is any real number and we would call that the mean cuz that's going to be the mean. But we would also call that the location. Because we're just adding a constant, it means a shift in location. We're not changing what the density looks like by adding mu, we're just moving it around left and right. And sigma is any positive number, mu could be negative, sigma has to be positive, and that's called the standard deviation. Remember standard deviation we defined as the square root of variance. So sigma is the standard deviation but we also call that the scale because we're just rescaling everything by multiplying by a constant. So that's gonna effect if you draw one of the density, it's gonna effect how wide or how narrow that curve is. It still has to integrate to 1, so you can't just make it really big and wide and suddenly you made the area blow up. You also have to make sure that you multiply by a normalizing constant so it still integrates to 1, but you can still make it more wide or more narrow. Okay then we say Then we say X is normal with mean mu and variance sigma squared. So those are the two parameters. So the reason most books would do this a little bit differently and start by writing down the PDF of this. But this is a more useful and more insightful way to think about it, where we're saying there's just one fundamental basic normal distribution. That's what we call the standard normal. Once we understand the standard normal we can easily get any other normal distribution we want just by multiplying by a constant adding a constant. So it's reducing everything back down to the standard normal. That's really useful to always keep that in mind instead of just looking at ugly formulas, okay? So let's actually check that this has the desired mean and variance. So obviously the expected value of X just by linearity with mu plus sigma expected value of Z is 0, so that's just mu, just immediate from this. For the variance, Then we need to talk a little bit more about what happens, what are the properties of variance. So I'll come back to this in a minute. First, let's just talk a little bit more in general about variance. We did a quick introduction to variance before but we should go a little bit further. So remember, there's two ways to write variance. The definition is to subtract off the mean, square it, the average distance squared of X from its mean. But we also showed that can also be written as E(X) squared, this way, minus E(X) squared the other way, okay? Now in particular, if we had the variance of X plus a constant, intuitively, if we just add a constant we're not changing how variable X is, right? So intuitively that should be the same as the variance of X. And you can see that immediately from this first formula because You replace by x by x + c, and the mean also shifts by c by linearity, you get the exact same thing. So that's immediate from this, so adding a constant has no effect on the variance. Now if we multiply by a constant, then from either of these formulas, just imagine sticking in a c here and a c here. But the c comes out because of linearity again, but it's squared, then. So the variance of a c times x is c squared times the variance of x. And a common mistake is to forget the square here, but that really messes things up, so variance is coming out with the square. And an easy way to see that is, if c is negative, this is still valid. But if you forgot to write the square here, you would get a negative variance. If you ever get a negative variance, that's very, very bad, variance cannot be negative. So anytime you compute a variance, the first thing you should check is, is the thing I wrote down at least nonnegative? And the only case where it could be 0 is if it's a constant, so it's always greater than or equal to 0. And variance of X = 0 if and only if X is a constant with probability 1. P of X = a = 1 for some a, that is, with probability 0, something bad could happen. But with probability 1, it always equals this constant a. So that would have variance 0 because the stuff with probability 0 doesn't affect anything, so essentially it's a constant. If it's not a constant, the variance will be strictly positive. Okay, so that's the variance of a constant times x, and then just one other factor about, we'll do a lot more with variance like after the midterm. But only one other thing to point out for now is that variance, unlike expected values, variance is not linear. So variance of x + y is not equal to variance x plus variance of y. In general, it may be equal, but it's not necessarily equal, so actually, it violates both of the linearity properties. If it were linear, we would want constants to come out as themselves, and here it comes out squared. And we can't say the variance of the sum is the sum of the variances. It is equal, we're not gonna show this until later, we'll show this at sometime after the midterm. It is equal if x and y are independent, but remember, linearity holds regardless of whether the random variables are independent or not. So if they're independent, it will be equal, we'll show that later, but in general, they're not equal. And one quick example of that would be, what if we look at the variance of x + x? All right, that's an extreme case of dependence, that's when x, it's actually the same thing, right? Well, the variance of x + x Is the variance of 2x, which we just said is 4 times the variance of x. So if this were true, if this were equal, we would get 2 times the variance of x. And this says we get 4 times the variability, not 2 times the variability, but that's just a simple example of that. But that's also a common mistake that I've seen before when students are dealing with, in the past I've asked questions either on homeworks or exams where we have something like 2x. And a lot of students took the approach of, well, 2x is x + x. Of course, that's valid, but then at that point, they made the mistake of replacing x + x by, let's say, x1 + x2. Where those are IID, with the same distribution as x. That's completely wrong because x is not IID with itself. It's extremely dependent and then somehow replacing it by independent copies, then it doesn't work. So I'm telling you to be careful of this, just keeping track of dependents versus independents. Here they're extremely dependent, and so that's why we got this 4 here. And I think, intuitively, that should make some sense, right? If this was like x1 and x2 and they're independent, then the variabilities just add. Here, they're exactly the same, so that magnifies the variability, okay. So that's a few quick notes about variance, so now coming back to this for the normal case. We just saw that adding mu does nothing to the variance, multiplying by sigma. Then it comes out as sigma squared, that's sigma squared times the variance of z. Well, that's just sigma squared, okay, so that confirms that when we write this, this is the mean and this is the variance. So those are the two parameters of the normal distribution. Ane whenever you have a normal distribution, you should always think about reducing it back to standard normal. So we could also go the other way around, and I don't need much space for this. Because this is just, I'm just gonna solve this equation for z, so if we do it the other way, solve for z. z equals x minus mu over sigma, very easy algebra, that's called standardization. So standardization says, I'm just going the other direction here. I was starting with the standard normal, and we can construct a general normal this way. Now what if we wanted to go the other way, we started with x, which is normal mu sigma squared. Subtract the mean divided by the standard deviation, and that will always give us a standard normal. So that process is called standardization, it's very, very useful, it's simple, right, just subtract the mean divided by the standard deviation. And yet sometimes students get confused about it, or divide by the variance instead of dividing by the standard deviation, or just don't think to do it in the first place. So that's why I'm emphasizing that, it's a simple but useful transformation. Okay, so as a quick example of how we use that, let's derive the PDF of the general normal. Find PDF of normal mu sigma squared, well, one way to find it is to look it up in a book. But that doesn't tell you anything, that's just like a formula in a book. So what we want to understand is, assuming that we already know the PDF of the standard normal, how can we get the PDF of the nonstandard normal? In a way, that's easy, without having to memorize stuff, okay, so let's call this x again, so let's find the CDF first. So by definition, this is just good practice with CDFs. Everyone here should make sure that you're good at CDFs and PDFs and PMFs. And that just takes practice, so this is just some simple practice with that. By definition, the CDF is this, and now I just told you that a useful trick is to standardize, so let's standardize this. It's the same thing as saying X minus mu over sigma is less than or equal to lowercase x minus mu over sigma, right. Sigma is positive, so it doesn't flip the inequality to do that, so I standardized it. The reason I standardized it was because now, this thing on the left is standard normal. So by definition, this is just the CDF of the standard normal evaluated here. So by definition, we immediately know that's just capital phi of x minus mu over sigma, now to get the PDF, To get the PDF we just have to take the derivative of the CDF. That's just the chain rule right, because this capital phi is the outer function and then we have this inner function here so it's just the chain rule from basic calculus. It's the derivative of the outer function evaluated here, times the derivative of the inner function. The derivative of this inner function is just 1 over sigma, right, cuz 1 over sigma times x. So we are gonna get a 1 over sigma, and then we are gonna get the derivative of this the derivative of capital phi is just the standard normal PDF, right? And it says evaluated here, so I'm just gonna write down the standard normal PDF, and I'm gonna evaluate it at x mu over sigma. And that's it, we're done. So it should be a very, very quick calculation in order to be able to do that. And as another quick example. Let's say over here in the corner, we said what happens, z is standard normal, what happens to z? Let's also ask the question of what happens to x? Well, you could work through a similar calculation, but I think the neatest way to think of it is, we're thinking of x as mu + sigma z. So x mu + sigma times z. But z is standard normal. So this is just of the form some location constant plus sigma times the standard normal. So we immediately know that's normal mu sigma squared. Which again, makes sense intuitively, because we put a minus sign, so we put a minus sign on the mean. We do not put a minus sign on the variance, because variance can't be negative, so the variants stay sigma squared. So you could do a calculation for this, but this is just immediate from thinking of x in terms of the standard normal. So this is the easiest way to do this, okay? And a useful fact just to know, but we'll prove this much later in the course. Later we'll show that if x1, let's say xj is normal mu j, sigma j squared, and they're independent. Let's say for j equals 1 to 2. Then, x1 + x2 is normal, mu1 + mu2 sigma 1 squared + sigma 2 squared. So that's something we need to prove, and we'll do that much later. The sum of independent normals is normal, but the reason I'm mentioning it now is just let's think about what happens to the mean and variance. By linearity, we know that the mean would have to be mu1 + mu2. Variance, this is something else we'll prove later. In the independent case we can just add up the variances, so it's juts sigma1 squared + sigma2 squared. Now what if we looked at x1 x2? I'm mentioning this now, because I can't even count the number of times when I've seen the mean is mu1 mu2. That's just linearity again. I can't even count the number of times I've seen students write that the variance is sigma1 squared sigma2 squared. Well, first of all that could be negative, so that doesn't make any sense. And secondly, any time you see a subtraction you can really think of that as adding the negative of something, right? So this is + of x2. And x2 still has variance sigma2 squared, so the variance is still add. That's just a useful fact to keep in mind, we'll prove it later. But I'm mainly talking about right now just in terms of what happens to the mean and variance. Later we'll see why are they still normal. That's just one very useful property of the normal. So let's just do a lot of things without leaving the realm of normality, right? If you added two of them and then it somehow becomes some completely different distribution, it's gonna be hard to work with. So that's a very nice property of the normal. Okay, one other fact about the normal that's just like a rule of thumb for the normal. Because of the fact that you can't actually compute this function, capital phi other than by having a table of values. Or a computer that, or a calculator that specifically knows how to do that function. You can't do it in terms of other functions, it's useful to just have a few quick rules of thumb, so there's something called the 68 9599.7% rule. And I don't know who named it that, but at the first time I heard of this that's the stupidest name for a rule that I have ever heard of. However, then I always remember that, so actually it works very well. It simply says it's just the three simple numbers telling us how likely is it that a normal random variable will be a certain distance from its mean measured in terms of standard deviation. So this says that, if x is normal, then the statement is that the probability that x is more than 1 standard deviation from its mean. So notationally we would just write it like that. But intuitively, that's just saying what's the chance that it falls more than 1 standard deviation, right? That's 1 standard deviation. This would say the distance is more than 1 standard deviation away from the mean. Well, I was actually right the other way. The probability that x is within 1 standard deviation of its mean is about 68%. The chance that x is within 2 standard deviations of its mean is about 95%. And the chance that it's same with 3 standard deviations is about 99.7%. So, in other words, it's very common for people in practice to add and subtract 2 standard deviations. What that's saying is for the normal, that's gonna have 95% chance of so, let's say you got a bunch of observations from this distribution independently. We would expect about 95% of them are gonna be within 2 standard deviations of the mean, 99.7% within 3. So you can convert these statements into statements about capital phi which is good practice while just making sure you understand what capital phi is. But basically, this is just a few values of capital phi just written in kind of a more intuitive way. Okay, so that's all for the normal distribution. So the main thing left to talk more about is LOTUS, and a couple examples of LOTUS and using LOTUS to compute variances. For example, we proved that the variance of the Poisson is Poisson lambda has, sorry. We proved that the mean of a Poisson lambda is lambda. We have not yet derived the variance of a Poisson lambda. So that's definitely something we should do. So, okay. So let's do the variance of the Poisson. And that will also give us a change to understand more, what's really going on with LOTUS? Why does LOTUS really work? So suppose we had a random variable such as the Poisson, but right now I'm just thinking in general. A random variable who's possible values are zero, one, two, three, and so on. So let's call our random variable x. And x can be 0, 1, 2, 3, etc, okay? And suppose that its pmf. To say what the pmf is I just need to say what's the probability of, 0 let's call that P0 probability of 1, P1, P2, P3. So all I did here was write out the pmf, just stringing it out as a sequence, right? But that's just specifying the pmf and I'm calling them pj is the probability that x equals j. Now to figure out variance we need to study xsquared, right? So let's look at x squared. So 0squared is 0, 1squared is 1, 2squared is 4, 3squared is 9, and we keep going like that. From this point of view, it should be easy to see what we should do. Because E(x), remember for a discrete random variable E(x) is the sum of x times the pmf. Now here we want E(xsquared), but notice that the probability that xsquared equals say 3squared is just the probability P3 of being in this column here, right? So the probabilities didn't change, and we could just still use xsquared times the probability that x = x, right? Because when an xsquared takes on these possible values with these same probabilities. That's what LOTUS is saying, so it's pretty intuitive in that sense. The case that you have to think more about is the case where this function is not 1 to 1. So now squaring is not 1 to 1 in general. If I had had negative numbers, then you would have duplicates here and you would have to sort that out. What LOTUS says is even when you have those duplications, this still works. That I think is a little less obvious, if you think about it you can see why it's true, but it's not completely obvious. In this case, because we're not nonnegative anyway, this is one to one and then it just immediately true, okay? But LOTUS this is saying, no matter how complicated your function is, something kind of this flavor still works, regardless of whether you have duplications. So now we're ready to get the Poisson variance. So this is just in general if you have a random variable nonnegative integer values. Now let's look at the specific case of Poisson lambda and we want to find E(x) squared. And according to LOTUS we can just write that as the sum k = 0 to infinity ksquared E to the minus lambda, lambda to the k over k factorial, that's the pmf. So we have to figure out how to do this sum, and this looks like a pretty unfamiliar sum. I mean my first thought when I see this would be, well this is k times k and we can cancel and get a k minus one factorial here. And there's nothing wrong with doing that but it's still kind of annoying because we still have ksquared up here. When we were just planning the mean, then we just had a k and we cancelled it and things are nice. But now we have a ksquared, it's more annoying, okay? So here's another method for dealing with something like that. The general method is start with what we know, right? So what we know how to do is the Taylor series for e to the x. Hopefully you all know that by now, we keep using it over and over again. The sum, I'll write it in terms of lambda. The sum of lambda to the k over k factorial. Is e to the lambda, and this is valid for all real lambda, even for imaginary numbers, complex numbers, this is always true, always converges. Now if I wanna get a k in front, then a natural strategy would be to take the derivative of both sides. Well that's pretty nice right, because the derivative e to the lambda is e to the lambda. The derivative of the lefthand side, I'll start the sum at 1 now because at 0 it's 0. So we have k lambda to the k 1 over k factorial. I just took the derivative of both sides. I exchanged the derivative and the sum, which is valid under some mild technical conditions. Now we're getting closer, but we still only have a k, not a k squared, okay? So my first impulse would be, take a derivative again, that's slightly annoying cuz then I'd get a k1 coming down, I want a k, not a k1. So to fix that, all we have to do is multiply both sides by lambda, okay? So, just put lambda on both sides. So I call that replenishing the lambdas. We just replenish it, that we have a lambda there. I'll write it again, k equals one to infinity. K, lambda to the k over k factorial equals lambda e to the lambda. We've replenished our supply of lambda's, now we can take the derivative again and we have what we want. Okay, so I take the derivative a second time and k = 1 to infinity, take the derivative again, now it's ksquared. Lambda to the k 1 over k factorial. Well now we have to use the product rule, the derivative of lambda, e to the lambda is lambda e to the lambda plus e to the lambda by the product rule. Which we can factor out as e to the lambda times lambda + 1. Okay, well that's exactly the sum that we needed. Cuz this e to the minus lambda comes out, so this is e to the minus lambda, e to the lambda, lambda + 1. I'm missing some, is there another lambda somewhere? Lets see, we have to replenish it again. Just put a lambda. Okay, so here we have lambda to the k 1, there we want lambda to the k. So its replenish again, then there is another lambda there, okay. I'm just bringing this k 1 back to being lambda to the k, right? So that's just lambda squared + lambda. And now we have the variance. So the variance of X equals this thing, lambda squared plus lambda minus the square of the mean, which is lambda squared equals lambda. So this course is not really about memorizing formulas, but that's one that's very easy and useful to remember. The Poisson lambda has mean lambda, and has variance lambda. So that's kind of a strange property if you think about it. That the mean equals the variance, it's a little bit, maybe it would seem more natural if the mean equal the standard deviation or something like that, because then those are kind of in the same scale. But Poisson, it doesn't actually have units. Poisson is just counting numbers of things, so it doesn't have that some dimensional interpretation. So, yeah, I wanted to also mention that about standardization as well. Another reason this thing is really nice to work with in the normal is if you think of normal as being a continuous measurement in some unit, it could be a unit of length, time, mass, whatever. If x is measured in whatever unit you want, let's say it's time measured in seconds, then that's seconds minus seconds divided by seconds, the seconds cancel out. That means this is a dimensionless quantity, which is part of what's making this standardization, it's kind of making it more directly interpretable instead of having to worry about whether you measured it in seconds or years. So if we started with one measurement in seconds and one measurement in years and standardized both of them, we get the same thing. The same measurement in different units. So that's a nice property of that. Okay, so that's the variance of the Poisson. We haven't yet gotten the variance of the binomial, so I'd like to do that. There's an easy way and a hard way. Well, except the hard way I don't think, actually sorry, there's three ways to do it. There's a really easy way that we can't do yet because we haven't proven the necessary fact. There's an easy way that we can do, so that's what I'm gonna do. And then there's an annoying way, which we're not gonna do. The annoying but direct is we want the variance of a binomial. We wanna find the variance. The most direct obvious way to do this would be to use lotus to get E(x squared) which would mean you would have to write down something like this, except here we wrote the Poisson PMF. Instead you'd have to write n choose k, p to the k, whatever, the binomial PMF, right. And then you'd have to do that sum. And you can do it, but that's pretty tedious. And you have to figure out how to do that sum and do a lot of algebra. Okay, so that's the way I don't wanna do it. The easiest way to do it would be using this fact here. Which is that the variance of a sum of independent things is the sum of the variance, if they're independent, right. That's if, okay. So the easiest one, we haven't proven this yet so this is not valid to do it at this way right now but just kinda foreshadowing. We can think of the binomial, we've emphasized the fact that we can think of a binomial as the sum of n independent Bernoulli p. So once we prove this fact, that's applicable. So all we have to do is get the variance of Bernoulli p, which is a really easy calculation cuz the Bernoulli is just zero one, so that's a very very easy calculation. To get the variance of a Bernoulli p and multiply by n, that's the neatest way to do it. You can do it that way in your head once we get to that point, okay. Now here's kind of the compromise method which is also just good practice with other concepts we've done, especially indicator random variables. So I'm still going to use the same idea of representing x as a sum of Iid Bernoulli p. So I'll write them as I1 plus blah, blah, blah, plus In, just to emphasize the fact that they're indicators, I for indicator. Where Ijs are Iid Bournulli p, right. So we've been doing this many times already. That's just an indicator of success on the jth trial, add up those and we get a binomial. Okay, so now if we want the expected value of x squared, Let's just square this thing. Let's actually not do the expected value yet. We'll just square it then take the expected value. So just square this thing. Well you know you do i1 squared and just square all the things, right. So it's i1 squared plus blah blah blah plus In squared plus, but as you know we get a lot of cross terms, right. Your imagining this big thing times itself, so every possible cross term, each one twice, you have 2I 1I 2 and 2I 1I 3 and so on. All possible cross terms and each cross term has 2 in front. Just like when you square x+y, you get x squared + y squared + 2xy. We get all these cross terms. It doesn't matter what order we write them in. Maybe we've ordered them in this way. So that's the last one. It doesn't matter the order. Okay, so it's all the cross terms. That looks pretty complicated. But it's actually much simpler than it looks. Now let's take the expected value of both sides, use linearity. Of the same, this is a good review example as well. We're using the same tricks, symmetry, indicator, random variables, and linearity. Each of these, these are Iid. So by symmetry, this is just n times anyone of them. So let's just say nE(1 squared). That's just immediate by symmetry, right. So we don't have to write that big sum, just n times one of them. And how let's just count how many of these, well there's n choose two cross terms, right. Because for any pair of subscripts we have a cross term. So it's really just 2(n choose 2), and then just take one of them for concreteness, let's say E(I1I2). Now this is even nicer, well it definitely is looking better. But this is even better than it looks because I1 is either just 1 or 0. If you square one you got one, if you square zero you got zero. So I1 squared is just I1. So E(I1), that's just the expectorate of Bernoulli p is p. So that's just np+n choose 2 is n times n1 over 2, so the 2s cancel. So this is really just n(n1). Now let's think about this indicator random variable. Well I called it an indicator random variable, well actually it's a product of indicator random variables. But actually a product of indicator random variables is an indicator random variable. This thing here is the indicator of success on both the first and the second trial, right. Because if you think of multiplying two numbers that are zero and one, you get zero if at least one of these is zero, you would get one if they're both one. So that's the indicator of success on both. So it's a product but it's actually just one indicator. Success on both trials, number 1 and 2. So its expected value is just the probability of that happening. That probability of success on both the first trial and the second trial, because the trials are independent, is just p squared. Okay, so that's just, so what we just computed is the second moment of the binomial. That's np+, if we multiply that np+n squared p squarednp squared, right. Now to get the variance all we have to do is subtract, The square of the mean, okay. So we showed before that a binomial np has mean n times p. So if we square that, that's this term n squared p squared, so that cancels. So we're just canceling out this middle term and we just have np np squared = np (1 p). Which we would often write as npq with q = 1 p. So binomial variance is npq. So that's just a good review of indicator of random variables and all of that stuff. So now we know the variance of the Poisson, the normal, the uniform, the binomial. For the geometric, it's kind of a similar calculation, we did the mean of the geometric in two different ways. The flavor of the calculation is similar to this except we have a geometric series instead of the Taylor series for e to the x. So I don't think it's worth doing that in class. So in general hypergeometric, let's talk a little bit about hypergeometric, that's pretty nasty. In the sense that in the hypergeometric, we could write it as a sum of indicator random variables. We're imagining we're drawing balls one at a time and, or picking elk one at a time and success is getting a tagged elk. But the problem is that they're not independent. So as far as the mean is concerned we still use linearity. For the variance it's more complicated. So we'll worry about the variance of a hypergeometric after the midterm. That's more complicated. But for the binomial this is really, well, actually we could still. Here I didn't actually use the factor there independent cuz I was just using linearity. So you could use a similar approach, so actually you could do it this way, but it would be too tedious to do it like on a midterm or something. But you could square it, if these are dependent well, you can still work out the probability that the first two elk that you pick are both tagged. You could do that without too much trouble. But it's pretty messy looking. All right, so that's variance, and I guess the last thing to do is just to explain more about why is LOTUS true? And the basic proof of that is actually kind of conceptually similar to how we proved linearity. So we're trying to prove LOTUS, and I'm only gonna prove it for a discrete. Let's say discrete sample space. That's the case where I'm imagining finitely many pebbles. In the general case the ideas are not essentially different. It's just that we kind of need to write down some fancier integrals and use more kind of more technical math, but the concept is similar. So this is enough to give you the idea. So for discrete sample space, so the statement is that the expected value, that's all we are trying to show, is that the E(g(x)) can be written as the sum of g(x) P(X=x). So right, we can use the PMF of x we do not have to first work on figuring out the distribution of g(x). That's all we are trying to do, so let's think about it. Let's think about it as a sum of, sum over the other. We could sum the other way around, sorry. Let me say this a different way, let me remind you of the identity that we use for proving linearity. That was this group versus ungroup thing. So what we have is two different ways to write a certain sum. We could either write this thing, g(x)P(X=x) or we could write it the other way, which is a sum over all s. Each s, we're thinking of that as s in the sample space S. So each little s is a pebble. And if we're summing it up pebble by pebble, then what we're doing is remember random variables are functions. So, and g(x) just means we apply the function x then apply the function g. So we're just computing g(x(s)), that's just the definition times the mass of that pebble. So. If you stare at this equation long enough, and we have five minutes left to stare at that equation, so that's plenty of time. This is why LOTUS is true. It's just a matter of understanding this equation. So I'm gonna talk a little more about, how do you make sense of this equation? This is the grouped case. This is the ungrouped case. Remember I talked about pebbles and super pebbles, ungrouped. This says take each pebble, compute this function, g of x of s, and you take a weighted average. Those are the weights. This says, first combine all of the pebbles that have the same value of x into you know, superpebbles. A superpebble means we grouped together all pebbles with the same x value, not the same g(x) value, the same x value. Group those together then average, you get the same thing. So if I want to write that out in a little bit more detail. One way to think of it is as a double sum, right? Because we could imagine first summing over x. I'm gonna break this sum up. What I just explained to you was the intuition for why this is equal to this. Because we're just grouping them together in different ways so we changed the weights around, but as long as we changed the weights appropriately we should get the same average. That's the intuition. But for any of you who wanna see more of an algebraic reason, justification for that, the way to think of it is as a double sum. So the double sum would be, I mean to rewrite this says sum overall pebbles right? But one way to think of that would be first sum over values of little x. And then for each value of little x, sum over all pebbles, s such that x(s) = x. Because this is just a sum of a bunch of a numbers. We can sum them in any order we want. So I can rearrange them, in this particular order where I'm saying first sum over the little x values, and then group together, and sum over all the pebbles that have that value. It's the exact same thing, I just reordered the terms. So that's g(x(s)) times P(s). Now let's just simplify this double sum. The reason I wanted to write it as a double sum like this is that within this inner summation X(s)=, so this thing is just g(x). The cool thing is that g(x) does not depend on s so that comes out. So we actually have the sum over x of g(x) times the sum of what ever is left p(s). And now so that's summed over all s such that s(x) = x. And now we're done with the proof because this sum here is just saying add up the masses of all the pebbles labeled x. In other words, that's what I called a super pebble. The super pebble, the mass is the sum of all the masses of the little pebbles that form the super pebble. That's p of, this is just practice, this is going back to the very beginning of, events and what's a random variable. That's just the event X = x. We talked, what does it mean for big X to equal little x, right? What does that equation mean? That's an event. That's this event that we have here. Okay, so that's why that's true. So that's why LOTUS is true. Anyway that's all for now and Friday we'll review, let me know if you have any suggestions for things to do on Friday.
Contents
Examples
Often, location–scale families are restricted to those where all members have the same functional form. Most location–scale families are univariate, though not all. Wellknown families in which the functional form of the distribution is consistent throughout the family include the following:
 Normal distribution
 Elliptical distributions
 Cauchy distribution
 Uniform distribution (continuous)
 Uniform distribution (discrete)
 Logistic distribution
 Laplace distribution
 Student's tdistribution
 Generalized extreme value distribution
Converting a single distribution to a location–scale family
The following shows how to implement a location–scale family in a statistical package or programming environment where only functions for the "standard" version of a distribution are available. It is designed for R but should generalize to any language and library.
The example here is of the Student's tdistribution, which is normally provided in R only in its standard form, with a single degrees of freedom parameter df
. The versions below with _ls
appended show how to generalize this to a generalized Student's tdistribution with an arbitrary location parameter mu
and scale parameter sigma
.
Probability density function (PDF):  dt_ls(x, df, mu, sigma) =

1/sigma * dt((x  mu)/sigma, df)

Cumulative distribution function (CDF):  pt_ls(x, df, mu, sigma) =

pt((x  mu)/sigma, df)

Quantile function (inverse CDF):  qt_ls(prob, df, mu, sigma) =

qt(prob, df)*sigma + mu

Generate a random variate:  rt_ls(df, mu, sigma) =

rt(df)*sigma + mu

Note that the generalized functions do not have standard deviation sigma
since the standard t distribution does not have standard deviation of 1.
References
 ^ Meyer, Jack (1987). "TwoMoment Decision Models and Expected Utility Maximization". American Economic Review. 77 (3): 421–430. JSTOR 1804104.
 ^ Mayshar, J. (1978). "A Note on Feldstein's Criticism of MeanVariance Analysis". Review of Economic Studies. 45 (1): 197–199. JSTOR 2297094.
 ^ Sinn, H.W. (1983). Economic Decisions under Uncertainty (Second English ed.). NorthHolland.