Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".
YouTube Encyclopedic

1/5Views:83 590454 6981 186 44821 53370 785

21. Bayesian Statistical Inference I

A visual guide to Bayesian thinking

The Bayesian Trap

Introduction to Bayesian data analysis  part 1: What is Bayes?

Introduction to Bayesian Statistics, part 1: The basic concepts
Transcription
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: It involves real phenomena out there. So we have real stuff that happens. So it might be an arrival process to a bank that we're trying to model. This is a reality, but this is what we have been doing so far. We have been playing with models of probabilistic phenomena. And somehow we need to tie the two together. The way these are tied is that we observe the real world and this gives us data. And then based on these data, we try to come up with a model of what exactly is going on. For example, for an arrival process, you might ask the model in question, is my arrival process Poisson or is it something different? If it is Poisson, what is the rate of the arrival process? Once you come up with your model and you come up with the parameters of the model, then you can use it to make predictions about reality or to figure out certain hidden things, certain hidden aspects of reality, that you do not observe directly, but you try to infer what they are. So that's where the usefulness of the model comes in. Now this field is of course tremendously useful. And it shows up pretty much everywhere. So we talked about the polling examples in the last couple of lectures. This is, of course, a real application. You sample and on the basis of the sample that you have, you try to make some inferences about, let's say, the preferences in a given population. Let's say in the medical field, you want to try whether a certain drug makes a difference or not. So people would do medical trials, get some results, and then from the data somehow you need to make sense of them and make a decision. Is the new drug useful or is it not? How do we go systematically about the question of this type? A sexier, more recent topic, there's this famous Netflix competition where Netflix gives you a huge table of movies and people. And people have rated the movies, but not everyone has watched all of the movies in there. You have some of the ratings. For example, this person gave a 4 to that particular movie. So you get the table that's partially filled. And the Netflix asks you to make recommendations to people. So this means trying to guess. This person here, how much would they like this particular movie? And you can start thinking, well, maybe this person has given somewhat similar ratings with another person. And if that other person has also seen that movie, maybe the rating of that other person is relevant. But of course it's a lot more complicated than that. And this has been a serious competition where people have been using every heavy, wet machinery that there is in statistics, trying to come up with good recommendation systems. Then the other people, of course, are trying to analyze financial data. Somebody gives you the sequence of the values, let's say of the SMP index. You look at something like this and you can ask questions. How do I model these data using any of the models that we have in our bag of tools? How can I make predictions about what's going to happen afterwards, and so on? On the engineering side, anywhere where you have noise inference comes in. Signal processing, in some sense, is just an inference problem. You observe signals that are noisy and you try to figure out exactly what's happening out there or what kind of signal has been sent. Maybe the beginning of the field could be traced a few hundred years ago where people would observe, make astronomical observations of the position of the planets in the sky. They would have some beliefs that perhaps the orbits of planets is an ellipse. Or if it's a comet, maybe it's a parabola, hyperbola, don't know what it is. But they would have a model of that. But, of course, astronomical measurements would not be perfectly exact. And they would try to find the curve that fits these data. How do you go about choosing this particular curve on the base of noisy data and try to do it in a somewhat principled way? OK, so questions of this type clearly the applications are all over the place. But how is this related conceptually with what we have been doing so far? What's the relation between the field of inference and the field of probability as we have been practicing until now? Well, mathematically speaking, what's going to happen in the next few lectures could be just exercises or homework problems in the class in based on what we have done so far. That means you're not going to get any new facts about probability theory. Everything we're going to do will be simple applications of things that you already do know. So in some sense, statistics and inference is just an applied exercise in probability. But actually, things are not that simple in the following sense. If you get a probability problem, there's a correct answer. There's a correct solution. And that correct solution is unique. There's no ambiguity. The theory of probability has clearly defined rules. These are the axioms. You're given some information about probability distributions. You're asked to calculate certain other things. There's no ambiguity. Answers are always unique. In statistical questions, it's no longer the case that the question has a unique answer. If I give you data and I ask you what's the best way of estimating the motion of that planet, reasonable people can come up with different methods. And reasonable people will try to argue that's my method has these desirable properties but somebody else may say, here's another method that has certain desirable properties. And it's not clear what the best method is. So it's good to have some understanding of what the issues are and to know at least what is the general class of methods that one tries to consider, how does one go about such problems. So we're going to see lots and lots of different inference methods. We're not going to tell you that one is better than the other. But it's important to understand what are the concepts between those different methods. And finally, statistics can be misused really badly. That is, one can come up with methods that you think are sound, but in fact they're not quite that. I will bring some examples next time and talk a little more about this. So, they want to say, you have some data, you want to make some inference from them, what many people will do is to go to Wikipedia, find a statistical test that they think it applies to that situation, plug in numbers, and present results. Are the conclusions that they get really justified or are they misusing statistical methods? Well, too many people actually do misuse statistics and conclusions that people get are often false. So it's important to, besides just being able to copy statistical tests and use them, to understand what are the assumptions between the different methods and what kind of guarantees they have, if any. All right, so we'll try to do a quick tour through the field of inference in this lecture and the next few lectures that we have left this semester and try to highlight at the very high level the main concept skills, and techniques that come in. Let's start with some generalities and some general statements. One first statement is that statistics or inference problems come up in very different guises. And they may look as if they are of very different forms. Although, at some fundamental level, the basic issues turn out to be always pretty much the same. So let's look at this example. There's an unknown signal that's being sent. It's sent through some medium, and that medium just takes the signal and amplifies it by a certain number. So you can think of somebody shouting. There's the air out there. What you shouted will be attenuated through the air until it gets to a receiver. And that receiver then observes this, but together with some random noise. Here I meant S. S is the signal that's being sent. And what you observe is an X. You observe X, so what kind of inference problems could we have here? In some cases, you want to build a model of the physical phenomenon that you're dealing with. So for example, you don't know the attenuation of your signal and you try to find out what this number is based on the observations that you have. So the way this is done in engineering systems is that you design a certain signal, you know what it is, you shout a particular word, and then the receiver listens. And based on the intensity of the signal that they get, they try to make a guess about A. So you don't know A, but you know S. And by observing X, you get some information about what A is. So in this case, you're trying to build a model of the medium through which your signal is propagating. So sometimes one would call problems of this kind, let's say, system identification. In a different version of an inference problem that comes with this picture, you've done your modeling. You know your A. You know the medium through which the signal is going, but it's a communication system. This person is trying to communicate something to that person. So you send the signal S, but that person receives a noisy version of S. So that person tries to reconstruct S based on X. So in both cases, we have a linear relation between X and the unknown quantity. In one version, A is the unknown and we know S. In the other version, A is known, and so we try to infer S. Mathematically, you can see that this is essentially the same kind of problem in both cases. Although, the kind of practical problem that you're trying to solve is a little different. So we will not be making any distinctions between problems of the model building type as opposed to models where you try to estimate some unknown signal and so on. Because conceptually, the tools that one uses for both types of problems are essentially the same. OK, next a very useful classification of inference problems the unknown quantity that you're trying to estimate could be either a discrete one that takes a small number of values. So this could be discrete problems, such as the airplane radar problem we encountered back a long time ago in this class. So there's two possibilities an airplane is out there or an airplane is not out there. And you're trying to make a decision between these two options. Or you can have other problems would you have, let's say, four possible options. You don't know which one is true, but you get data and you try to figure out which one is true. In problems of these kind, usually you want to make a decision based on your data. And you're interested in the probability of making a correct decision. You would like that probability to be as high as possible. Estimation problems are a little different. Here you have some continuous quantity that's not known. And you try to make a good guess of that quantity. And you would like your guess to be as close as possible to the true quantity. So the polling problem was of this type. There was an unknown fraction f of the population that had some property. And you try to estimate f as accurately as you can. So the distinction here is that usually here the unknown quantity takes on discrete set of values. Here the unknown quantity takes a continuous set of values. Here we're interested in the probability of error. Here we're interested in the size of the error. Broadly speaking, most inference problems fall either in this category or in that category. Although, if you want to complicate life, you can also think or construct problems where both of these aspects are simultaneously present. OK, finally since we're in classification mode, there is a very big, important dichotomy into how one goes about inference problems. And here there's two fundamentally different philosophical points of view, which is how do we model the quantity that is unknown? In one approach, you say there's a certain quantity that has a definite value. It just happens that they don't know it. But it's a number. There's nothing random about it. So think of trying to estimate some physical quantity. You're making measurements, you try to estimate the mass of an electron, which is a sort of universal physical constant. There's nothing random about it. It's a fixed number. You get data, because you have some measuring apparatus. And that measuring apparatus, depending on what that results that you get are affected by the true mass of the electron, but there's also some noise. You take the data out of your measuring apparatus and you try to come up with some estimate of that quantity theta. So this is definitely a legitimate picture, but the important thing in this picture is that this theta is written as lowercase. And that's to make the point that it's a real number, not a random variable. There's a different philosophical approach which says, well, anything that I don't know I should model it as a random variable. Yes, I know. The mass of the electron is not really random. It's a constant. But I don't know what it is. I have some vague sense, perhaps, what it is perhaps because of the experiments that some other people carried out. So perhaps I have a prior distribution on the possible values of Theta. And that prior distribution doesn't mean that the nature is random, but it's more of a subjective description of my subjective beliefs of where do I think this constant number happens to be. So even though it's not truly random, I model my initial beliefs before the experiment starts. In terms of a prior distribution, I view it as a random variable. Then I observe another related random variable through some measuring apparatus. And then I use this again to create an estimate. So these two pictures philosophically are very different from each other. Here we treat the unknown quantities as unknown numbers. Here we treat them as random variables. When we treat them as a random variables, then we know pretty much already what we should be doing. We should just use the Bayes rule. Based on X, find the conditional distribution of Theta. And that's what we will be doing mostly over this lecture and the next lecture. Now in both cases, what you end up getting at the end is an estimate. But actually, that estimate is what kind of object is it? It's a random variable in both cases. Why? Even in this case where theta was a constant, my data are random. I do my data processing. So I calculate a function of the data, the data are random variables. So out here we output something which is a function of a random variable. So this quantity here will be also random. It's affected by the noise and the experiment that I have been doing. That's why these estimators will be denoted by uppercase Thetas. And we will be using hats. Hat, usually in estimation, means an estimate of something. All right, so this is the big picture. We're going to start with the Bayesian version. And then the last few lectures we're going to talk about the nonBayesian version or the classical one. By the way, I should say that statisticians have been debating fiercely for 100 years whether the right way to approach statistics is to go the classical way or the Bayesian way. And there have been tides going back and forth between the two sides. These days, Bayesian methods tend to become a little more popular for various reasons. We're going to come back to this later. All right, so in Bayesian estimation, what we got in our hands is Bayes rule. And if you have Bayes rule, there's not a lot that's left to do. We have different forms of the Bayes rule, depending on whether we're dealing with discrete data, And discrete quantities to estimate, or continuous data, and so on. In the hypothesis testing problem, the unknown quantity Theta is discrete. So in both cases here, we have a P of Theta. We obtain data, the X's. And on the basis of the X that we observe, we can calculate the posterior distribution of Theta, given the data. So to use Bayesian inference, what do we start with? We start with some priors. These are our initial beliefs about what Theta that might be. That's before we do the experiment. We have a model of the experimental aparatus. And the model of the experimental apparatus tells us if this Theta is true, I'm going to see X's of that kind. If that other Theta is true, I'm going to see X's that they are somewhere else. That models my apparatus. And based on that knowledge, once I observe I have these two functions in my hands, we have already seen that if you know those two functions, you can also calculate the denominator here. So all of these functions are available, so you can compute, you can find a formula for this function as well. And as soon as you observe the data, that X's, you plug in here the numerical value of those X's. And you get a function of Theta. And this is the posterior distribution of Theta, given the data that you have seen. So you've already done a fair number of exercises of these kind. So we not say more about this. And there's a similar formula as you know for the case where we have continuous data. If the X's are continuous random variable, then the formula is the same, except that X's are described by densities instead of being described by a probability mass functions. OK, now if Theta is continuous, then we're dealing with estimation problems. But the story is once more the same. You're going to use the Bayes rule to come up with the posterior density of Theta, given the data that you have observed. Now just for the sake of the example, let's come back to this picture here. Suppose that something is flying in the air, and maybe this is just an object in the air close to the Earth. So because of gravity, the trajectory that it's going to follow it's going to be a parabola. So this is the general equation of a parabola. Zt is the position of my objects at time t. But I don't know exactly which parabola it is. So the parameters of the parabola are unknown quantities. What I can do is to go and measure the position of my objects at different times. But unfortunately, my measurements are noisy. What I want to do is to model the motion of my object. So I guess in the picture, the axis would be t going this way and Z going this way. And on the basis of the data that they get, these are my X's. I want to figure out the Thetas. That is, I want to figure out the exact equation of this parabola. Now if somebody gives you probability distributions for Theta, these would be your priors. So this is given. We need the conditional distribution of the X's given the Thetas. Well, we have the conditional distribution of Z, given the Thetas from this equation. And then by playing with this equation, you can also find how is X distributed if Theta takes a particular value. So you do have all of the densities that you might need. And you can apply the Bayes rule. And at the end, your end result would be a formula for the distribution of Theta, given to the X that you have observed except for one sort of computation, or to make things more interesting. Instead of these X's and Theta's being single random variables that we have here, typically those X's and Theta's will be multidimensional random variables or will correspond to multiple ones. So this little Theta here actually stands for a triplet of Theta0, Theta1, and Theta2. And that X here stands here for the entire sequence of X's that we have observed. So in reality, the object that you're going to get at to the end after inference is done is a function that you plug in the values of the data and you get the function of the Theta's that tells you the relative likelihoods of different Theta triplets. So what I'm saying is that this is no harder than the problems that you have dealt with so far, except perhaps for the complication that's usually in interesting inference problems. Your Theta's and X's are often the vectors of random variables instead of individual random variables. Now if you are to do estimation in a case where you have discrete data, again the situation is no different. We still have a Bayes rule of the same kind, except that densities gets replaced by PMF's. If X is discrete, you put a P here instead of putting an f. So an example of an estimation problem with discrete data is similar to the polling problem. You have a coin. It has an unknown parameter Theta. This is the probability of obtaining heads. You flip the coin many times. What can you tell me about the true value of Theta? A classical statistician, at this point, would say, OK, I'm going to use an estimator, the most reasonable one, which is this. How many heads did they obtain in n trials? Divide by the total number of trials. This is my estimate of the bias of my coin. And then the classical statistician would continue from here and try to prove some properties and argue that this estimate is a good one. For example, we have the weak law of large numbers that tells us that this particular estimate converges in probability to the true parameter. This is a kind of guarantee that's useful to have. And the classical statistician would pretty much close the subject in this way. What would the Bayesian person do differently? The Bayesian person would start by assuming a prior distribution of Theta. Instead of treating Theta as an unknown constant, they would say that Theta would speak randomly or pretend that it would speak randomly and assume a distribution on Theta. So for example, if you don't know they need anything more, you might assume that any value for the bias of the coin is as likely as any other value of the bias of the coin. And this way so the probability distribution that's uniform. Or if you have a little more faith in the manufacturing processes that's created that coin, you might choose your prior to be a distribution that's centered around 1/2 and sits fairly narrowly centered around 1/2. That would be a prior distribution in which you say, well, I believe that the manufacturer tried to make my coin to be fair. But they often makes some mistakes, so it's going to be, I believe, it's approximately 1/2 but not quite. So depending on your beliefs, you would choose an appropriate prior for the distribution of Theta. And then you would use the Bayes rule to find the probabilities of different values of Theta, based on the data that you have observed. So no matter which version of the Bayes rule that you use, the end product of the Bayes rule is going to be either a plot of this kind or a plot of that kind. So what am I plotting here? This axis is the Theta axis. These are the possible values of the unknown quantity that we're trying to estimate. In the continuous case, theta is a continuous random variable. I obtain my data. And I plot for the posterior probability distribution after observing my data. And I'm plotting here the probability density for Theta. So this is a plot of that density. In the discrete case, theta can take finitely many values or a discrete set of values. And for each one of those values, I'm telling you how likely is that the value to be the correct one, given the data that I have observed. And in general, what you would go back to your boss and report after you've done all your inference work would be either a plot of this kinds or of that kind. So you go to your boss who asks you, what is the value of Theta? And you say, well, I only have limited data. That I don't know what it is. It could be this, with so much probability. There's probability. OK, let's throw in some numbers here. There's probability 0.3 that Theta is this value. There's probability 0.2 that Theta is this value, 0.1 that it's this one, 0.1 that it's this one, 0.2 that it's that one, and so on. OK, now bosses often want simple answers. They say, OK, you're talking too much. What do you think Theta is? And now you're forced to make a decision. If that was the situation and you have to make a decision, how would you make it? Well, I'm going to make a decision that's most likely to be correct. If I make this decision, what's going to happen? Theta is this value with probability 0.2, which means there's probably 0.8 that they make an error if I make that guess. If I make that decision, this decision has probably 0.3 of being the correct one. So I have probably of error 0.7. So if you want to just maximize the probability of giving the correct decision, or if you want to minimize the probability of making an incorrect decision, what you're going to choose to report is that value of Theta for which the probability is highest. So in this case, I would choose to report this particular value, the most likely value of Theta, given what I have observed. And that value is called them maximum a posteriori probability estimate. It's going to be this one in our case. So picking the point in the posterior PMF that has the highest probability. That's the reasonable thing to do. This is the optimal thing to do if you want to minimize the probability of an incorrect inference. And that's what people do usually if they need to report a single answer, if they need to report a single decision. How about in the estimation context? If that's what you know about Theta, Theta could be around here, but there's also some sharp probability that it is around here. What's the single answer that you would give to your boss? One option is to use the same philosophy and say, OK, I'm going to find the Theta at which this posterior density is highest. So I would pick this point here and report this particular Theta. So this would be my Theta, again, Theta MAP, the Theta that has the highest a posteriori probability, just because it corresponds to the peak of the density. But in this context, the maximum a posteriori probability theta was the one that was most likely to be true. In the continuous case, you cannot really say that this is the most likely value of Theta. In a continuous setting, any value of Theta has zero probability, so when we talk about densities. So it's not the most likely. It's the one for which the density, so the probabilities of that neighborhoods, are highest. So the rationale for picking this particular estimate in the continuous case is much less compelling than the rationale that we had in here. So in this case, reasonable people might choose different quantities to report. And the very popular one would be to report instead the conditional expectation. So I don't know quite what Theta is. Given the data that I have, Theta has this distribution. Let me just report the average over that distribution. Let me report to the center of gravity of this figure. And in this figure, the center of gravity would probably be somewhere around here. And that would be a different estimate that you might choose to report. So center of gravity is something around here. And this is a conditional expectation of Theta, given the data that you have. So these are two, in some sense, fairly reasonable ways of choosing what to report to your boss. Some people might choose to report this. Some people might choose to report that. And a priori, if there's no compelling reason why one would be preferable than other one, unless you set some rules for the game and you describe a little more precisely what your objectives are. But no matter which one you report, a single answer, a point estimate, doesn't really tell you the whole story. There's a lot more information conveyed by this posterior distribution plot than any single number that you might report. So in general, you may wish to convince your boss that's it's worth their time to look at the entire plot, because that plot sort of covers all the possibilities. It tells your boss most likely we're in that range, but there's also a distinct change that our Theta happens to lie in that range. All right, now let us try to perhaps differentiate between these two and see under what circumstances this one might be the better estimate to perform. Better with respect to what? We need some rules. So we're going to throw in some rules. As a warm up, we're going to deal with the problem of making an estimation if you had no information at all, except for a prior distribution. So this is a warm up for what's coming next, which would be estimation that takes into account some information. So we have a Theta. And because of your subjective beliefs or models by others, you believe that Theta is uniformly distributed between, let's say, 4 and 10. You want to come up with a point estimate. Let's try to look for an estimate. Call it c, in this case. I want to pick a number with which to estimate the value of Theta. I will be interested in the size of the error that I make. And I really dislike large errors, so I'm going to focus on the square of the error that they make. So I pick c. Theta that has a random value that I don't know. But whatever it is, once it becomes known, it results into a squared error between what it is and what I guessed that it was. And I'm interested in making a small air on the average, where the average is taken with respect to all the possible and unknown values of Theta. So the problem, this is a least squares formulation of the problem, where we try to minimize the least squares errors. How do you find the optimal c? Well, we take that expression and expand it. And it is, using linearity of expectations square minus 2c expected Theta plus c squared that's the quantity that we want to minimize, with respect to c. To do the minimization, take the derivative with respect to c and set it to 0. So that differentiation gives us from here minus 2 expected value of Theta plus 2c is equal to 0. And the answer that you get by solving this equation is that c is the expected value of Theta. So when you do this optimization, you find that the optimal estimate, the things you should be reporting, is the expected value of Theta. So in this particular example, you would choose your estimate c to be just the middle of these values, which would be 7. OK, and in case your boss asks you, how good is your estimate? How big is your error going to be? What you could report is the average size of the estimation error that you are making. We picked our estimates to be the expected value of Theta. So for this particular way that I'm choosing to do my estimation, this is the mean squared error that I get. And this is a familiar quantity. It's just the variance of the distribution. So the expectation is that best way to estimate a quantity, if you're interested in the mean squared error. And the resulting mean squared error is the variance itself. How will this story change if we now have data as well? Now having data means that we can compute posterior distributions or conditional distributions. So we get transported into a new universe where instead the working with the original distribution of Theta, the prior distribution, now we work with the condition of distribution of Theta, given the data that we have observed. Now remember our old slogan that conditional models and conditional probabilities are no different than ordinary probabilities, except that we live now in a new universe where the new information has been taken into account. So if you use that philosophy and you're asked to minimize the squared error but now that you live in a new universe where X has been fixed to something, what would the optimal solution be? It would again be the expectation of theta, but which expectation? It's the expectation which applies in the new conditional universe in which we live right now. So because of what we did before, by the same calculation, we would find that the optimal estimates is the expected value of X of Theta, but the optimal estimate that takes into account the information that we have. So the conclusion, once you get your data, if you want to minimize the mean squared error, you should just report the conditional estimation of this unknown quantity based on the data that you have. So the picture here is that Theta is unknown. You have your apparatus that creates measurements. So this creates an X. You take an X, and here you have a box that does calculations. It does calculations and it spits out the conditional expectation of Theta, given the particular data that you have observed. And what we have done in this class so far is, to some extent, developing the computational tools and skills to do with this particular calculation how to calculate the posterior density for Theta and how to calculate expectations, conditional expectations. So in principle, we know how to do this. In principle, we can program a computer to take the data and to spit out condition expectations. Somebody who doesn't think like us might instead design a calculating machine that does something differently and produces some other estimate. So we went through this argument and we decided to program our computer to calculate conditional expectations. Somebody else came up with some other crazy idea for how to estimate the random variable. They came up with some function g and the programmed it, and they designed a machine that estimates Theta's by outputting a certain g of X. That could be an alternative estimator. Which one is better? Well, we convinced ourselves that this is the optimal one in a universe where we have fixed the particular value of the data. So what we have proved so far is a relation of this kind. In this conditional universe, the mean squared error that I get I'm the one who's using this estimator is less than or equal than the mean squared error that this person will get, the person who uses that estimator. For any particular value of the data, I'm going to do better than the other person. Now the data themselves are random. If I average over all possible values of the data, I should still be better off. If I'm better off for any possible value X, then I should be better off on the average over all possible values of X. So let us average both sides of this quantity with respect to the probability distribution of X. If you want to do it formally, you can write this inequality between numbers as an inequality between random variables. And it tells that no matter what that random variable turns out to be, this quantity is better than that quantity. Take expectations of both sides, and you get this inequality between expectations overall. And this last inequality tells me that the person who's using this estimator who produces estimates according to this machine will have a mean squared estimation error that's less than or equal to the estimation error that's produced by the other person. In a few words, the conditional expectation estimator is the optimal estimator. It's the ultimate estimating machine. That's how you should solve estimation problems and report a single value. If you're forced to report a single value and if you're interested in estimation errors. OK, while we could have told you that story, of course, a month or two ago, this is really about interpretation  about realizing that conditional expectations have a very nice property. But other than that, any probabilistic skills that come into this business are just the probabilistic skills of being able to calculate conditional expectations, which you already know how to do. So conclusion, all of optimal Bayesian estimation just means calculating and reporting conditional expectations. Well, if the world were that simple, then statisticians wouldn't be able to find jobs if life is that simple. So real life is not that simple. There are complications. And that perhaps makes their life a little more interesting. OK, one complication is that we would deal with the vectors instead of just single random variables. I use the notation here as if X was a single random variable. In real life, you get several data. Does our story change? Not really, same argument given all the data that you have observed, you should still report the conditional expectation of Theta. But what kind of work does it take in order to report this conditional expectation? One issue is that you need to cook up a plausible prior distribution for Theta. How do you do that? In a given application , this is a bit of a judgment call, what prior would you be working with. And there's a certain skill there of not making silly choices. A more pragmatic, practical issue is that this is a formula that's extremely nice and compact and simple that you can write with minimal ink. But the behind it there could be hidden a huge amount of calculation. So doing any sort of calculations that involve multiple random variables really involves calculating multidimensional integrals. And the multidimensional integrals are hard to compute. So implementing actually this calculating machine here may not be easy, might be complicated computationally. It's also complicated in terms of not being able to derive intuition about it. So perhaps you might want to have a simpler version, a simpler alternative to this formula that's easier to work with and easier to calculate. We will be talking about one such simpler alternative next time. So again, to conclude, at the high level, Bayesian estimation is very, very simple, given that you have mastered everything that has happened in this course so far. There are certain practical issues and it's also good to be familiar with the concepts and the issues that in general, you would prefer to report that complete posterior distribution. But if you're forced to report a point estimate, then there's a number of reasonable ways to do it. And perhaps the most reasonable one is to just the report the conditional expectation itself.
Contents
 1 Introduction to Bayes' rule
 2 Formal description of Bayesian inference
 3 Inference over exclusive and exhaustive possibilities
 4 Mathematical properties
 5 Examples
 6 In frequentist statistics and decision theory
 7 Applications
 8 Bayes and Bayesian inference
 9 History
 10 See also
 11 Notes
 12 References
 13 Further reading
 14 External links
Introduction to Bayes' rule
Formal
Bayesian inference derives the posterior probability as a consequence of two antecedents, a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem:
where
 means "event conditional on" (so that means A given B).
 stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
 the evidence corresponds to new data that were not used in computing the prior probability.
 , the prior probability, is the estimate of the probability of the hypothesis before the data , the current evidence, is observed.
 , the posterior probability, is the probability of given , i.e., after is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
 is the probability of observing given . As a function of with fixed, this is the likelihood – it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, , while the posterior probability is a function of the hypothesis, .
 is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis does not appear anywhere in the symbol, unlike for all the other factors), so this factor does not enter into determining the relative probabilities of different hypotheses.
For different values of , only the factors and , both in the numerator, affect the value of – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).
Bayes' rule can also be written as follows:
where the factor can be interpreted as the impact of on the probability of .
Informal
If the evidence does not match up with a hypothesis, one should reject the hypothesis. But if a hypothesis is extremely unlikely a priori, one should also reject it, even if the evidence does appear to match up. For example, if one does not know whether the newborn baby next door is a boy or a girl, the color of decorations on the crib in front of the door may support the hypothesis of one gender or the other; but if behind that door, instead of the crib, a dog kennel is found, the posterior probability that the family next door gave birth to a dog remains small in spite of the "evidence", since one's prior belief in such a hypothesis was already extremely small.
The critical point about Bayesian inference, then, is that it provides a principled way of combining new evidence with prior beliefs, through the application of Bayes' rule. (Contrast this with frequentist inference, which relies only on the evidence as a whole, with no reference to prior beliefs.)
Furthermore, Bayes' rule can be applied iteratively: after observing some evidence, the resulting posterior probability can then be treated as a prior probability, and a new posterior probability computed from new evidence. This allows for Bayesian principles to be applied to various kinds of evidence, whether viewed all at once or over time. This procedure is termed "Bayesian updating".
Alternatives to Bayesian updating
Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.
Ian Hacking noted that traditional "Dutch book" arguments did not specify Bayesian updating: they left open the possibility that nonBayesian updating rules could avoid Dutch books. Hacking wrote^{[1]} "And neither the Dutch book argument, nor any other in the personalist arsenal of proofs of the probability axioms, entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."
Indeed, there are nonBayesian updating rules that also avoid Dutch books (as discussed in the literature on "probability kinematics") following the publication of Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.^{[2]} The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.^{[3]}
Formal description of Bayesian inference
Definitions
 , a data point in general. This may in fact be a vector of values.
 , the parameter of the data point's distribution, i.e., . This may in fact be a vector of parameters.
 , the hyperparameter of the parameter distribution, i.e., . This may in fact be a vector of hyperparameters.
 is the sample, a set of observed data points, i.e., .
 , a new data point whose distribution is to be predicted.
Bayesian inference
 The prior distribution is the distribution of the parameter(s) before any data is observed, i.e. .
 The prior distribution might not be easily determined. In this case, we can use the Jeffreys prior to obtain the posterior distribution before updating them with newer observations.
 The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written .
 The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. .
 The posterior distribution is the distribution of the parameter(s) after taking into account the observed data. This is determined by Bayes' rule, which forms the heart of Bayesian inference:
Note that this is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
Bayesian prediction
 The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior:
 The prior predictive distribution is the distribution of a new data point, marginalized over the prior:
Bayesian theory calls for the use of the posterior predictive distribution to do predictive inference, i.e., to predict the distribution of a new, unobserved data point. That is, instead of a fixed point as a prediction, a distribution over possible points is returned. Only this way is the entire posterior distribution of the parameter(s) used. By comparison, prediction in frequentist statistics often involves finding an optimum point estimate of the parameter(s)—e.g., by maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance of the predictive distribution.
(In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's tdistribution. This correctly estimates the variance, due to the fact that (1) the average of normally distributed random variables is also normally distributed; (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a student's tdistribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least, to an arbitrary level of precision, when numerical methods are used.)
Note that both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, and hence the prior and posterior distributions come from the same family, it can easily be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.
Inference over exclusive and exhaustive possibilities
If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.
General formulation
Suppose a process is generating independent and identically distributed events , but the probability distribution is unknown. Let the event space represent the current state of belief for this process. Each model is represented by event . The conditional probabilities are specified to define the models. is the degree of belief in . Before the first inference step, is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.
Suppose that the process is observed to generate . For each , the prior is updated to the posterior . From Bayes' theorem:^{[4]}
Upon observation of further evidence, this procedure may be repeated.
Multiple observations
For a sequence of independent and identically distributed observations , it can be shown by induction that repeated application of the above is equivalent to
Where
Parametric formulation
By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is however equally applicable to discrete distributions.
Let the vector span the parameter space. Let the initial prior distribution over be , where is a set of parameters to the prior itself, or hyperparameters. Let be a sequence of independent and identically distributed event observations, where all are distributed as for some . Bayes' theorem is applied to find the posterior distribution over :
Where
Mathematical properties
Interpretation of factor
. That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.
Cromwell's rule
If then . If , then . This can be interpreted to mean that hard convictions are insensitive to counterevidence.
The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not " in place of "", yielding "if , then ", from which the result immediately follows.
Asymptotic behaviour of posterior
Consider the behaviour of a belief distribution as it is updated a large number of times with independent and identically distributed trials. For sufficiently nice prior probabilities, the Bernsteinvon Mises theorem gives that in the limit of infinite trials, the posterior converges to a Gaussian distribution independent of the initial prior under some conditions firstly outlined and rigorously proven by Joseph L. Doob in 1948, namely if the random variable in consideration has a finite probability space. The more general results were obtained later by the statistician David A. Freedman who published in two seminal research papers^{[citation needed]} in 1963 and 1965 when and under what circumstances the asymptotic behaviour of posterior is guaranteed. His 1963 paper treats, like Doob (1949), the finite case and comes to a satisfactory conclusion. However, if the random variable has an infinite but countable probability space (i.e., corresponding to a die with infinite many faces) the 1965 paper demonstrates that for a dense subset of priors the Bernsteinvon Mises theorem is not applicable. In this case there is almost surely no asymptotic convergence. Later in the 1980s and 1990s Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.^{[5]} To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.
Conjugate priors
In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.
Estimates of parameters and predictions
It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.
For onedimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.^{[6]}
If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.^{[7]}^{[citation needed]}
Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:^{[8]}^{[citation needed]}
There are examples where no maximum is attained, in which case the set of MAP estimates is empty.
There are other methods of estimation that minimize the posterior risk (expectedposterior loss) with respect to a loss function, and these are of interest to statistical decision theory using the sampling distribution ("frequentist statistics").^{[9]}^{[citation needed]}
The posterior predictive distribution of a new observation (that is independent of previous observations) is determined by^{[10]}^{[citation needed]}
Examples
Probability of a hypothesis
Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let correspond to bowl #1, and to bowl #2. It is given that the bowls are identical from Fred's point of view, thus , and the two must add up to 1, so both are equal to 0.5. The event is the observation of a plain cookie. From the contents of the bowls, we know that and Bayes' formula then yields
Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, , which was 0.5. After observing the cookie, we must revise the probability to , which is 0.6.
Making a prediction
An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?
The degree of belief in the continuous variable (century) is to be calculated, with the discrete set of events as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,
Assume a uniform prior of , and that trials are independent and identically distributed. When a new fragment of type is discovered, Bayes' theorem is applied to update the degree of belief for each :
A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. Note that the Bernsteinvon Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events is finite (see above section on asymptotic behaviour of the posterior).
In frequentist statistics and decision theory
A decisiontheoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.^{[11]}
Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of frequentist inference as parameter estimation, hypothesis testing, and computing confidence intervals.^{[12]} For example:
 "Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."^{[11]}
 "In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."^{[13]}
 "In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."^{[14]}
 "A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"^{[15]}
 "An important area of investigation in the development of admissibility ideas has been that of conventional samplingtheory procedures, and many interesting results have been obtained."^{[16]}
Model selection
Applications
Computer applications
Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s. There is also an ever growing connection between Bayesian methods and simulationbased Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.^{[17]} Recently Bayesian inference has gained popularity amongst the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.
As applied to statistical classification, Bayesian inference has been used in recent years to develop algorithms for identifying email spam. Applications which make use of Bayesian inference for spam filtering include CRM114, DSPAM, Bogofilter, SpamAssassin, SpamBayes, Mozilla, XEAMS, and others. Spam classification is treated in more detail in the article on the naive Bayes classifier.
Solomonoff's Inductive inference is the theory of prediction based on observations; for example, predicting the next symbol based upon a given series of symbols. The only assumption is that the environment follows some unknown but computable probability distribution. It is a formal inductive framework that combines two wellstudied principles of inductive inference: Bayesian statistics and Occam’s Razor.^{[18]} Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.^{[19]}^{[20]}
In the courtroom
Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for 'beyond a reasonable doubt'.^{[21]}^{[22]}^{[23]} Bayes' theorem is applied successively to all evidence presented, with the posterior from one stage becoming the prior for the next. The benefit of a Bayesian approach is that it gives the juror an unbiased, rational mechanism for combining evidence. It may be appropriate to explain Bayes' theorem to jurors in odds form, as betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach, replacing multiplication with addition, might be easier for a jury to handle.
If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.^{[24]} For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.
The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence expert witness explained Bayes' theorem to the jury in R v Adams. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."
GardnerMedwin^{[25]} argues that the criterion on which a verdict in a criminal trial should be based is not the probability of guilt, but rather the probability of the evidence, given that the defendant is innocent (akin to a frequentist pvalue). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:
 A The known facts and testimony could have arisen if the defendant is guilty
 B The known facts and testimony could have arisen if the defendant is innocent
 C The defendant is guilty.
GardnerMedwin argues that the jury should believe both A and notB in order to convict. A and notB implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.
Bayesian epistemology
Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.
Karl Popper and David Miller have rejected the alleged rationality of Bayesianism, i.e. using Bayes rule to make epistemological inferences:^{[26]} It is prone to the same vicious circle as any other justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.
Other
 The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments.^{[27]}
 Bayesian search theory is used to search for lost objects.
 Bayesian inference in phylogeny
 Bayesian tool for methylation analysis
 Bayesian approaches to brain function investigate the brain as a Bayesian mechanism.
 Bayesian inference in ecological studies^{[28]}^{[29]}
Bayes and Bayesian inference
The problem considered by Bayes in Proposition 9 of his essay, "An Essay towards solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.^{[citation needed]}
History
The term Bayesian refers to Thomas Bayes (1702–1761), who proved a special case of what is now called Bayes' theorem. However, it was PierreSimon Laplace (1749–1827) who introduced a general version of the theorem and used it to approach problems in celestial mechanics, medical statistics, reliability, and jurisprudence.^{[30]} Early Bayesian inference, which used uniform priors following Laplace's principle of insufficient reason, was called "inverse probability" (because it infers backwards from observations to parameters, or from effects to causes^{[31]}). After the 1920s, "inverse probability" was largely supplanted by a collection of methods that came to be called frequentist statistics.^{[31]}
In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "noninformative" current, the statistical analysis depends on only the model assumed, the data analyzed,^{[32]} and the method assigning the prior, which differs from one objective Bayesian to another objective Bayesian. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.
In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.^{[33]} Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.^{[34]} Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.^{[35]}
See also
 Bayes' theorem
 Bayesian Analysis, the journal of the ISBA
 Bayesian hierarchical modeling
 Bayesian probability
 Bayesian structural time series (BSTS)
 Inductive probability
 International Society for Bayesian Analysis (ISBA)
 Jeffreys prior
 Monty Hall problem
Notes
 ^ Hacking (1967, Section 3, p. 316), Hacking (1988, p. 124)
 ^ "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 20140105.
 ^ van Fraassen, B. (1989) Laws and Symmetry, Oxford University Press. ISBN 0198248601
 ^ Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.;Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN 9781439840955.
 ^ Larry Wasserman et alia, JASA 2000.
 ^ Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.
 ^ Choudhuri, Nidhan; Ghosal, Subhashis; Roy, Anindya (20050101). "Bayesian Methods for Function Estimation". Handbook of Statistics. Bayesian Thinking. 25: 373–414. doi:10.1016/s01697161(05)250137.
 ^ "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com. Retrieved 20170602.
 ^ Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). http://www.cogsci.ucsd.edu/. External link in
website=
(help)  ^ "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.
 ^ ^{a} ^{b} Bickel & Doksum (2001, p. 32)
 ^ * Kiefer, J.; Schwartz R. (1965). "Admissible Bayes Character of T^{2}, R^{2}, and Other Fully Invariant Tests for Multivariate Normal Problems". Annals of Mathematical Statistics. 36: 747–770. doi:10.1214/aoms/1177700051.
 Schwartz, R. (1969). "Invariant Proper Bayes Tests for Exponential Families". Annals of Mathematical Statistics. 40: 270–283. doi:10.1214/aoms/1177697822.
 Hwang, J. T. & Casella, George (1982). "Minimax Confidence Sets for the Mean of a Multivariate Normal Distribution". Annals of Statistics. 10: 868–881. doi:10.1214/aos/1176345877.
 ^ Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 "Admissibilty", and pp. 17–18 of Chapter 1.8 "Complete Classes"
 ^ Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. SpringerVerlag. ISBN 0387963073. (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)
 ^ Cox, D. R.; Hinkley, D.V (1974). Theoretical Statistics. Chapman and Hall. ISBN 0041215370. page 432
 ^ Cox, D. R.; Hinkley, D. V. (1974). Theoretical Statistics. Chapman and Hall. ISBN 0041215370. p. 433)
 ^ Jim Albert (2009). Bayesian Computation with R, Second edition. New York, Dordrecht, etc.: Springer. ISBN 9780387922973.
 ^ Samuel Rathmanner and Marcus Hutter. "A Philosophical Treatise of Universal Induction". Entropy, 13(6):1076–1136, 2011.
 ^ "The Problem of Old Evidence", in §5 of "On Universal Prediction and Bayesian Confirmation", M. Hutter  Theoretical Computer Science, 2007  Elsevier
 ^ "Raymond J. Solomonoff", Peter Gacs, Paul M. B. Vitanyi, 2011 cs.bu.edu
 ^ Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society, Series B, 58, 425–443.
 ^ Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.
 ^ Robertson, B. and Vignaux, G. A. (1995) Interpreting Evidence: Evaluating Forensic Science in the Courtroom. John Wiley and Sons. Chichester. ISBN 9780471960263
 ^ Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries
 ^ GardnerMedwin, A. (2005) "What Probability Should the Jury Address?". Significance, 2 (1), March 2005
 ^ David Miller: Critical Rationalism
 ^ Howson & Urbach (2005), Jaynes (2003)
 ^ Ogle, Kiona; Tucker, Colin; Cable, Jessica M. (20140101). "Beyond simple linear mixing models: processbased isotope partitioning of ecological processes". Ecological Applications. 24 (1): 181–195. doi:10.1890/1051076124.1.181. ISSN 19395582.
 ^ Evaristo, Jaivime; McDonnell, Jeffrey J.; Scholl, Martha A.; Bruijnzeel, L. Adrian; Chun, Kwok P. (20160101). "Insights into plant water uptake from xylemwater isotope measurements in two tropical catchments with contrasting moisture conditions". Hydrological Processes: n/a–n/a. doi:10.1002/hyp.10841. ISSN 10991085.
 ^ Stigler, Stephen M. (1986). "Chapter 3". The History of Statistics. Harvard University Press.
 ^ ^{a} ^{b} Fienberg, Stephen E. (2006). "When did Bayesian Inference Become 'Bayesian'?" (PDF). Bayesian Analysis. 1 (1): 1–40 [p. 5]. doi:10.1214/06ba101. Archived from the original (PDF) on 20140910.
 ^ Bernardo, JoséMiguel (2005). "Reference analysis". Handbook of statistics. 25. pp. 17–90.
 ^ Wolpert, R. L. (2004). "A Conversation with James O. Berger". Statistical Science. 19 (1): 205–218. doi:10.1214/088342304000000053. MR 2082155.
 ^ Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). ICOTS7.
 ^ Bishop, C. M. (2007). Pattern Recognition and Machine Learning. New York: Springer. ISBN 0387310738.
References
 Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier. ISBN 0123850487, ISBN 9780123850485
 Bickel, Peter J. & Doksum, Kjell A. (2001). Mathematical Statistics, Volume 1: Basic and Selected Topics (Second (updated printing 2007) ed.). Pearson Prentice–Hall. ISBN 013850363X.
 Box, G. E. P. and Tiao, G. C. (1973) Bayesian Inference in Statistical Analysis, Wiley, ISBN 0471574287
 Edwards, Ward (1968). "Conservatism in Human Information Processing". In Kleinmuntz, B. Formal Representation of Human Judgment. Wiley.
 Edwards, Ward (1982). "Conservatism in Human Information Processing (excerpted)". In Daniel Kahneman, Paul Slovic and Amos Tversky. Judgment under uncertainty: Heuristics and biases. Cambridge University Press.
 Jaynes E. T. (2003) Probability Theory: The Logic of Science, CUP. ISBN 9780521592710 (Link to Fragmentary Edition of March 1996).
 Howson, C. & Urbach, P. (2005). Scientific Reasoning: the Bayesian Approach (3rd ed.). Open Court Publishing Company. ISBN 9780812695786.
 Phillips, L. D.; Edwards, Ward (October 2008). "Chapter 6: Conservatism in a Simple Probability Inference Task (Journal of Experimental Psychology (1966) 72: 346354)". In Jie W. Weiss; David J. Weiss. A Science of Decision Making:The Legacy of Ward Edwards. Oxford University Press. p. 536. ISBN 9780195322989.
Further reading
 For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read Vallverdu, Jordi (2016). Bayesians Versus Frequentists A Philosophical Debate on Statistical Reasoning. New York: Springer. ISBN 9783662486382.
Elementary
The following books are listed in ascending order of probabilistic sophistication:
 Stone, JV (2013), "Bayes’ Rule: A Tutorial Introduction to Bayesian Analysis", Download first chapter here, Sebtel Press, England.
 Dennis V. Lindley (2013). Understanding Uncertainty, Revised Edition (2nd ed.). John Wiley. ISBN 9781118650127.
 Colin Howson & Peter Urbach (2005). Scientific Reasoning: The Bayesian Approach (3rd ed.). Open Court Publishing Company. ISBN 9780812695786.
 Berry, Donald A. (1996). Statistics: A Bayesian Perspective. Duxbury. ISBN 0534234763.
 Morris H. DeGroot & Mark J. Schervish (2002). Probability and Statistics (third ed.). AddisonWesley. ISBN 9780201524888.
 Bolstad, William M. (2007) Introduction to Bayesian Statistics: Second Edition, John Wiley ISBN 0471270202
 Winkler, Robert L (2003). Introduction to Bayesian Inference and Decision (2nd ed.). Probabilistic. ISBN 0964793849. Updated classic textbook. Bayesian theory clearly presented.
 Lee, Peter M. Bayesian Statistics: An Introduction. Fourth Edition (2012), John Wiley ISBN 9781118332573
 Carlin, Bradley P. & Louis, Thomas A. (2008). Bayesian Methods for Data Analysis, Third Edition. Boca Raton, FL: Chapman and Hall/CRC. ISBN 1584886978.
 Gelman, Andrew; Carlin, John B.; Stern, Hal S.; Dunson, David B.; Vehtari, Aki; Rubin, Donald B. (2013). Bayesian Data Analysis, Third Edition. Chapman and Hall/CRC. ISBN 9781439840955.
Intermediate or advanced
 Berger, James O (1985). Statistical Decision Theory and Bayesian Analysis. Springer Series in Statistics (Second ed.). SpringerVerlag. ISBN 0387960988.
 Bernardo, José M.; Smith, Adrian F. M. (1994). Bayesian Theory. Wiley.
 DeGroot, Morris H., Optimal Statistical Decisions. Wiley Classics Library. 2004. (Originally published (1970) by McGrawHill.) ISBN 047168029X.
 Schervish, Mark J. (1995). Theory of statistics. SpringerVerlag. ISBN 0387945466.
 Jaynes, E. T. (1998) Probability Theory: The Logic of Science.
 O'Hagan, A. and Forster, J. (2003) Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York. ISBN 0340529229.
 Robert, Christian P (2001). The Bayesian Choice – A DecisionTheoretic Motivation (second ed.). Springer. ISBN 0387942963.
 Glenn Shafer and Pearl, Judea, eds. (1988) Probabilistic Reasoning in Intelligent Systems, San Mateo, CA: Morgan Kaufmann.
 Pierre Bessière et al. (2013), "Bayesian Programming", CRC Press. ISBN 9781439880326
 Francisco J. Samaniego (2010), "A Comparison of the Bayesian and Frequentist Approaches to Estimation" Springer, New York, ISBN 9781441959409
External links
 Hazewinkel, Michiel, ed. (2001) [1994], "Bayesian approach to statistical problems", Encyclopedia of Mathematics, Springer Science+Business Media B.V. / Kluwer Academic Publishers, ISBN 9781556080104
 Bayesian Statistics from Scholarpedia.
 Introduction to Bayesian probability from Queen Mary University of London
 Mathematical Notes on Bayesian Statistics and Markov Chain Monte Carlo
 Bayesian reading list, categorized and annotated by Tom Griffiths
 A. Hajek and S. Hartmann: Bayesian Epistemology, in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93106.
 S. Hartmann and J. Sprenger: Bayesian Epistemology, in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609620.
 Stanford Encyclopedia of Philosophy: "Inductive Logic"
 Bayesian Confirmation Theory
 What Is Bayesian Learning?