To install click the Add extension button. That's it.

The source code for the WIKI 2 extension is being checked by specialists of the Mozilla Foundation, Google, and Apple. You could also do it yourself at any point in time.

Kelly Slayton
Congratulations on this excellent venture… what a great idea!
Alexander Grigorievskiy
I use WIKI 2 every day and almost forgot how the original Wikipedia looks like.
Live Statistics
English Articles
Improved in 24 Hours
Added in 24 Hours
What we do. Every page goes through several hundred of perfecting techniques; in live mode. Quite the same Wikipedia. Just better.

Prior probability

From Wikipedia, the free encyclopedia

In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

Bayes' theorem calculates the renormalized pointwise product of the prior and the likelihood function, to produce the posterior probability distribution, which is the conditional distribution of the uncertain quantity given the data.

Similarly, the prior probability of a random event or an uncertain proposition is the unconditional probability that is assigned before any relevant evidence is taken into account.

Priors can be created using a number of methods.[1](pp27–41) A prior can be determined from past information, such as previous experiments. A prior can be elicited from the purely subjective assessment of an experienced expert. An uninformative prior can be created to reflect a balance among outcomes when no information is available. Priors can also be chosen according to some principle, such as symmetry or maximizing entropy given constraints; examples are the Jeffreys prior or Bernardo's reference prior. When a family of conjugate priors exists, choosing a prior from that family simplifies calculation of the posterior distribution.

Parameters of prior distributions are a kind of hyperparameter. For example, if one uses a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:

  • p is a parameter of the underlying system (Bernoulli distribution), and
  • α and β are parameters of the prior distribution (beta distribution); hence hyperparameters.

Hyperparameters themselves may have hyperprior distributions expressing beliefs about their values. A Bayesian model with more than one level of prior like this is called a hierarchical Bayes model.

YouTube Encyclopedic

  • 1/5
    46 610
    161 814
    3 678
    7 621
    34 908
  • ✪ Bayes' Theorem for Everyone 04 - Prior Probability
  • ✪ Introduction to Bayesian statistics, part 1: The basic concepts
  • ✪ What is Prior Probability
  • ✪ (SSP 1.1.2) Implied Bayes Theorm - Likelihood, Priori, Posteriori
  • ✪ 17. Bayesian Statistics


If we're going to work with Bayes' Theorem or even talk intelligently about Bayes' Theorem we need to understand something called the "prior probability" or sometimes we just call it the "prior" for short when you're using bayes theorem what you're doing is adjusting your level of belief or updating the likelihood that you would assign to some hypothesis based on new evidence that's coming in what's being adjusted or updated well it's the prior probability it's the initial probability that you would assign to something if you didn't have that evidence here's how it works would you say that you're a schedule assign a certain likelihood to something but you're actually saying is that e given all the possible universes that could exist i think this many universes contain this as a truth if i have positive or confirming evidence coming in for my hypothesis that if i was pretty sure of myself it gets even better but if i wasn't very sure of myself it still gets bigger it just doesn't finish up as big you might ask how these pryor's gets at well we sat down based on our experiences up to the point where we get the new evidence that's why two people can be looking at the exact same evidence and still disagree about something let's do another one of those thought experiments let's say that i'm a believer in astrology in fact i'm such a believer that i organize my whole life around astrology and not say that you're a skeptic you believe that astrology is some superstition from the middle ages and not worth anything well what if i'm going to try to convince you that astrology is worthwhile what if i bring you some remarkable evidence let's say i came to you with this morning's newspaper and it has my horoscope and the horoscope says that your financial troubles are over then i tell you that i got this phone call out of the blue and i was offered this fantastic job let's say that you trust me completely and that both of us are looking at the same evidence we both agree that this is truly remarkable this evidence would be very unusual if we lived in a universe where astrology was not it working mechanism let's look at what this does to the peace now i believe big time in astrology so the relative number of universes where i think astrology is a real working force is large then we get this truly unexpected evidence my unusual offer of a lucrative job well we do our busy updating for this new evidence i get a minor increase in my level of certainty that astrology israel that's because my level of belief was pretty large to begin with but there's no evidence does increase my confidence now let's take a look at your plate you're the skeptic you had very little trust in astrology you would predict that relatively very few possible universes could contain the truth that astrology actually works in most of the universe is that you believed to be possible you believe people are deluding themselves or focusing on coincidences but both you and i agree on the evidence it's very surprising so it's a small red evidence circle when we do our busy and updating based on the evidence your confidence in astrology gets a big boost it made double or even triple in size but your confidence in astrology started out so small but you still believe it's a pile hooi you would have to see a whole lot more remarkable confirmations like the one we just had before you're going to believe in astrology now you might be thinking this whole thing is pretty subjective if you can come up with whatever reassure you want based on your prior beliefs now we have two important points to make about this number one base spirit is as good as you can do based their own shows us mathematically that this is as good as it gets you can't do any better based on limited data it would be great to have a whole lot of data if i was a manufacturer of a drug that was supposed to tour migraines i could collect a whole lot of data i could hire five thousand volunteers to test the drug carefully record how many were cured and how many got no results and i could use a different type of statistical analysis i could use what they call frequent ist analogous statistics to analyze how well this drug works and i could be pretty confident in my answer but unfortunately life doesn't work that way for the day-to-day things we deal with we don't have that much evidence and so please there is as good as you can do but it's not all that they act if you even have a fair amount of evidence something else can take place and this is point number two something happens called washing out the primers what happens is the priors become less and less important if you can collect a couple pieces of evidence here's how it works let's look at the peas let's say i have a certain level of confidence in some hypothesis and then let's say positive evidence comes in well i gain more confidence in the hypothesis and then what is more positive evidence comes in i'm even more confident and then yet again more positive evidence comes in while i finish up with a pretty high level of confidence in that hypothesis now what if some person having inappropriately small amount of confidence in the hypothesis well let's say he gets the same positive evidence the first time this evidence comes in in increases his confidence then again then again and eventually looking at the same evidence that i'm looking at we will begin to come to similar conclusions and trust that hypothesis so to people should come to the correct answer if they gave enough evidence and if they look at the evidence honestly that's why people with experience seem to know what they're doing and they get really good results let's say i was going to lay a brick patio i don't know anything about brick patios so the first thing i do is i don't talk to an expert someone who's done a lot of brick patios and i expect all the advice he gives me to beat one hundred-percent on the money the reason is because he's made a lot of the state's andy's we've had a lot of success is over a long period of time and this is close to to wash out his prior s no matter what nutty ideas this person had eventually he would've figured that out all of that ideas that he now has should be pretty good ideas and i want to know about that that's called washing out the priors now there's a way to forego this whole system and that is if you are absolutely certain about something if you're one hundred percent sure about something then the process of washing out the prior spoils a park now we all are probably very sure about certain things you're probably pretty certain that if you will to break out in front of you and let go of it at grady's going to pull into the ground and you're probably very sure that your mother loves you or that you might be sitting at computer right now interestingly sizes never one hundred percent sure of anything and has a good reason for that if you go back and look over the history of science it's littered with theories that we've we're so confident about and they turns out that they failed and there's a better theory that replaces them but that's a different subject now here's what if you're saying when you say that your one hundred-percent confident of something you're saying that this proposition that you're so confident existent every universe on the plate now this might work great for our ideas about things like gravity we're not likely to get contrary evidence but for other things it might mean that it will have to deflect evidence or that will have to modify or auxiliary hypotheses in order to keep our cherished hypothesis alive let's say i have some crazy idea let's say i'll believe that at this could pay liens are terrible drivers what actually saying is being idea that uh... it just depends are bayan drivers is actually a truth in every possible universe on my entire plate now let's say you're a more rational person and you're going to try to talk some sense into me you might point out that hey sally down the street she goes to the to the episcopalian church and she's a good driver i my camera that went by saying look i've talked to sally she doesn't agree with the religious beliefs of the people who go to that church so she's not really an episcopalian you might say okay then hello these insurance statistics they don't show that it as compelling as there are any worse drivers than anybody else i might say well of course they don't that would be so politically incorrect the interest come he's going to cover that went off at this point you might get pretty frustrated with me you might say what i've just over himself appeared right in front of you and said episcopalians are fine people they're good drivers stopped criticizing them i could even counted that went by saying i don't believe that was really just over i believe that was an evil spirit that's come to deceive me well at this point you might be pretty exasperated you might say ok is their anything they could convince you otherwise well that's a very good question to ask when you have to get to the point of the asking that question and if i can say that there's any evidence that would convince me otherwise that needs this conversation isn't going to go anywhere it means the hypothesis occupies every key on my plate and all the evidence has to fit inside that hypothesis no matter how many times i do my busy in updating my set of universes never contain anything other than universes where are believe that episcopalians are terrible drivers a person who's totally certain will never be convinced otherwise if you are one hundred percent share of something your pryor's will never washout the same is true if you're one hundred percent doubtful if you're totally convinced that some hypothesis is false that hypothesis doesn't exist in any universe on your plate your priorities zero no matter how much updating takes place that hypothesis will never become one of your beliefs based thereon shows us where our differences lie if you get to the point where someone's absolutely certain about something then it's not about the evidence anymore you may want to discuss with that person how reasonable they're being in fact there sometimes a psycho sees that are characterized by an absolute belief in something in the face of contrary evidence if i was one hundred percent sure that i had a brain tumor despite three neurologists and four m arise that said otherwise that i've probably lost my grip with reality being totally certain of something allows us to keep our belief but we never gain new knowledge now please tune into the next couple videos there's a lot more interesting ideas that fall right out of base theorem


Informative priors

An informative prior expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year.

This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting. The terms "prior" and "posterior" are generally relative to a specific datum or observation.

Weakly informative priors

A weakly informative prior expresses partial information about a variable. An example is, when setting the prior distribution for the temperature at noon tomorrow in St. Louis, to use a normal distribution with mean 50 degrees Fahrenheit and standard deviation 40 degrees, which very loosely constrains the temperature to the range (10 degrees, 90 degrees) with a small chance of being below -30 degrees or above 130 degrees. The purpose of a weakly informative prior is for regularization, that is, to keep inferences in a reasonable range.

Uninformative priors

An uninformative prior or diffuse prior expresses vague or general information about a variable. The term "uninformative prior" is somewhat of a misnomer. Such a prior might also be called a not very informative prior, or an objective prior, i.e. one that's not subjectively elicited.

Uninformative priors can express "objective" information such as "the variable is positive" or "the variable is less than some limit". The simplest and oldest rule for determining a non-informative prior is the principle of indifference, which assigns equal probabilities to all possibilities. In parameter estimation problems, the use of an uninformative prior typically yields results which are not too different from conventional statistical analysis, as the likelihood function often yields more information than the uninformative prior.

Some attempts have been made at finding a priori probabilities, i.e. probability distributions in some sense logically required by the nature of one's state of uncertainty; these are a subject of philosophical controversy, with Bayesians being roughly divided into two schools: "objective Bayesians", who believe such priors exist in many useful situations, and "subjective Bayesians" who believe that in practice priors usually represent subjective judgements of opinion that cannot be rigorously justified (Williamson 2010). Perhaps the strongest arguments for objective Bayesianism were given by Edwin T. Jaynes, based mainly on the consequences of symmetries and on the principle of maximum entropy.

As an example of an a priori prior, due to Jaynes (2003), consider a situation in which one knows a ball has been hidden under one of three cups, A, B, or C, but no other information is available about its location. In this case a uniform prior of p(A) = p(B) = p(C) = 1/3 seems intuitively like the only reasonable choice. More formally, we can see that the problem remains the same if we swap around the labels ("A", "B" and "C") of the cups. It would therefore be odd to choose a prior for which a permutation of the labels would cause a change in our predictions about which cup the ball will be found under; the uniform prior is the only one which preserves this invariance. If one accepts this invariance principle then one can see that the uniform prior is the logically correct prior to represent this state of knowledge. This prior is "objective" in the sense of being the correct choice to represent a particular state of knowledge, but it is not objective in the sense of being an observer-independent feature of the world: in reality the ball exists under a particular cup, and it only makes sense to speak of probabilities in this situation if there is an observer with limited knowledge about the system.

As a more contentious example, Jaynes published an argument (Jaynes 1968) based on Lie groups that suggests that the prior representing complete uncertainty about a probability should be the Haldane prior p−1(1 − p)−1. The example Jaynes gives is of finding a chemical in a lab and asking whether it will dissolve in water in repeated experiments. The Haldane prior[2] gives by far the most weight to and , indicating that the sample will either dissolve every time or never dissolve, with equal probability. However, if one has observed samples of the chemical to dissolve in one experiment and not to dissolve in another experiment then this prior is updated to the uniform distribution on the interval [0, 1]. This is obtained by applying Bayes' theorem to the data set consisting of one observation of dissolving and one of not dissolving, using the above prior. The Haldane prior is an improper prior distribution (meaning that it does not integrate to 1) that puts 100% of the probability content at either p = 0 or at p = 1 if a finite number of observations have given the same result. Harold Jeffreys devised a systematic way for designing uninformative proper priors for e.g., Jeffreys prior p−1/2(1 − p)−1/2 for the Bernoulli random variable.[clarification needed Not sure everybody agrees with this assertion.]

Priors can be constructed which are proportional to the Haar measure if the parameter space X carries a natural group structure which leaves invariant our Bayesian state of knowledge (Jaynes, 1968). This can be seen as a generalisation of the invariance principle used to justify the uniform prior over the three cups in the example above. For example, in physics we might expect that an experiment will give the same results regardless of our choice of the origin of a coordinate system. This induces the group structure of the translation group on X, which determines the prior probability as a constant improper prior. Similarly, some measurements are naturally invariant to the choice of an arbitrary scale (e.g., whether centimeters or inches are used, the physical results should be equal). In such a case, the scale group is the natural group structure, and the corresponding prior on X is proportional to 1/x. It sometimes matters whether we use the left-invariant or right-invariant Haar measure. For example, the left and right invariant Haar measures on the affine group are not equal. Berger (1985, p. 413) argues that the right-invariant Haar measure is the correct choice.

Another idea, championed by Edwin T. Jaynes, is to use the principle of maximum entropy (MAXENT). The motivation is that the Shannon entropy of a probability distribution measures the amount of information contained in the distribution. The larger the entropy, the less information is provided by the distribution. Thus, by maximizing the entropy over a suitable set of probability distributions on X, one finds the distribution that is least informative in the sense that it contains the least amount of information consistent with the constraints that define the set. For example, the maximum entropy prior on a discrete space, given only that the probability is normalized to 1, is the prior that assigns equal probability to each state. And in the continuous case, the maximum entropy prior given that the density is normalized with mean zero and variance unity is the standard normal distribution. The principle of minimum cross-entropy generalizes MAXENT to the case of "updating" an arbitrary prior distribution with suitable constraints in the maximum-entropy sense.

A related idea, reference priors, was introduced by José-Miguel Bernardo. Here, the idea is to maximize the expected Kullback–Leibler divergence of the posterior distribution relative to the prior. This maximizes the expected posterior information about X when the prior density is p(x); thus, in some sense, p(x) is the "least informative" prior about X. The reference prior is defined in the asymptotic limit, i.e., one considers the limit of the priors so obtained as the number of data points goes to infinity. In the present case, the KL divergence between the prior and posterior distributions is given by

Here, is a sufficient statistic for some parameter . The inner integral is the KL divergence between the posterior and prior distributions and the result is the weighted mean over all values of . Splitting the logarithm into two parts, reversing the order of integrals in the second part and noting that does not depend on yields

The inner integral in the second part is the integral over of the joint density . This is the marginal distribution , so we have

Now we use the concept of entropy which, in the case of probability distributions, is the negative expected value of the logarithm of the probability mass or density function or Using this in the last equation yields

In words, KL is the negative expected value over of the entropy of conditional on plus the marginal (i.e. unconditional) entropy of . In the limiting case where the sample size tends to infinity, the Bernstein-von Mises theorem states that the distribution of conditional on a given observed value of is normal with a variance equal to the reciprocal of the Fisher information at the 'true' value of . The entropy of a normal density function is equal to half the logarithm of where is the variance of the distribution. In this case therefore where is the arbitrarily large sample size (to which Fisher information is proportional) and is the 'true' value. Since this does not depend on it can be taken out of the integral, and as this integral is over a probability space it equals one. Hence we can write the asymptotic form of KL as

where is proportional to the (asymptotically large) sample size. We do not know the value of . Indeed, the very idea goes against the philosophy of Bayesian inference in which 'true' values of parameters are replaced by prior and posterior distributions. So we remove by replacing it with and taking the expected value of the normal entropy, which we obtain by multiplying by and integrating over . This allows us to combine the logarithms yielding

This is a quasi-KL divergence ("quasi" in the sense that the square root of the Fisher information may be the kernel of an improper distribution). Due to the minus sign, we need to minimise this in order to maximise the KL divergence with which we started. The minimum value of the last equation occurs where the two distributions in the logarithm argument, improper or not, do not diverge. This in turn occurs when the prior distribution is proportional to the square root of the Fisher information of the likelihood function. Hence in the single parameter case, reference priors and Jeffreys priors are identical, even though Jeffreys has a very different rationale.

Reference priors are often the objective prior of choice in multivariate problems, since other rules (e.g., Jeffreys' rule) may result in priors with problematic behavior.[clarification needed A Jeffreys prior is related to KL divergence?]

Objective prior distributions may also be derived from other principles, such as information or coding theory (see e.g. minimum description length) or frequentist statistics (see frequentist matching). Such methods are used in Solomonoff's theory of inductive inference. Constructing objective priors have been recently introduced in bioinformatics, and specially inference in cancer systems biology, where sample size is limited and a vast amount of prior knowledge is available. In these methods, either an information theory based criterion, such as KL divergence or log-likelihood function for binary supervised learning problems[3] and mixture model problems.[4]

Philosophical problems associated with uninformative priors are associated with the choice of an appropriate metric, or measurement scale. Suppose we want a prior for the running speed of a runner who is unknown to us. We could specify, say, a normal distribution as the prior for his speed, but alternatively we could specify a normal prior for the time he takes to complete 100 metres, which is proportional to the reciprocal of the first prior. These are very different priors, but it is not clear which is to be preferred. Jaynes' often-overlooked method of transformation groups can answer this question in some situations.[5]

Similarly, if asked to estimate an unknown proportion between 0 and 1, we might say that all proportions are equally likely, and use a uniform prior. Alternatively, we might say that all orders of magnitude for the proportion are equally likely, the logarithmic prior, which is the uniform prior on the logarithm of proportion. The Jeffreys prior attempts to solve this problem by computing a prior which expresses the same belief no matter which metric is used. The Jeffreys prior for an unknown proportion p is p−1/2(1 − p)−1/2, which differs from Jaynes' recommendation.

Priors based on notions of algorithmic probability are used in inductive inference as a basis for induction in very general settings.

Practical problems associated with uninformative priors include the requirement that the posterior distribution be proper. The usual uninformative priors on continuous, unbounded variables are improper. This need not be a problem if the posterior distribution is proper. Another issue of importance is that if an uninformative prior is to be used routinely, i.e., with many different data sets, it should have good frequentist properties. Normally a Bayesian would not be concerned with such issues, but it can be important in this situation. For example, one would want any decision rule based on the posterior distribution to be admissible under the adopted loss function. Unfortunately, admissibility is often difficult to check, although some results are known (e.g., Berger and Strawderman 1996). The issue is particularly acute with hierarchical Bayes models; the usual priors (e.g., Jeffreys' prior) may give badly inadmissible decision rules if employed at the higher levels of the hierarchy.

Improper priors

Let events be mutually exclusive and exhaustive. If Bayes' theorem is written as

then it is clear that the same result would be obtained if all the prior probabilities P(Ai) and P(Aj) were multiplied by a given constant; the same would be true for a continuous random variable. If the summation in the denominator converges, the posterior probabilities will still sum (or integrate) to 1 even if the prior values do not, and so the priors may only need to be specified in the correct proportion. Taking this idea further, in many cases the sum or integral of the prior values may not even need to be finite to get sensible answers for the posterior probabilities. When this is the case, the prior is called an improper prior. However, the posterior distribution need not be a proper distribution if the prior is improper. This is clear from the case where event B is independent of all of the Aj.

Statisticians sometimes[citation needed][6] use improper priors as uninformative priors. For example, if they need a prior distribution for the mean and variance of a random variable, they may assume p(mv) ~ 1/v (for v > 0) which would suggest that any value for the mean is "equally likely" and that a value for the positive variance becomes "less likely" in inverse proportion to its value. Many authors (Lindley, 1973; De Groot, 1937; Kass and Wasserman, 1996)[citation needed] warn against the danger of over-interpreting those priors since they are not probability densities. The only relevance they have is found in the corresponding posterior, as long as it is well-defined for all observations. (The Haldane prior is a typical counterexample.[clarification needed][citation needed])

By contrast, likelihood functions do not need to be integrated, and a likelihood function that is uniformly 1 corresponds to the absence of data (all models are equally likely, given no data): Bayes' rule multiplies a prior by the likelihood, and an empty product is just the constant likelihood 1. However, without starting with a prior probability distribution, one does not end up getting a posterior probability distribution, and thus cannot integrate or compute expected values or loss. See Likelihood function § Non-integrability for details.


Examples of improper priors include:

Note that these functions, interpreted as uniform distributions, can also be interpreted as the likelihood function in the absence of data, but are not proper priors.


  1. ^ Carlin, Bradley P.; Louis, Thomas A. (2008). Bayesian Methods for Data Analysis (Third ed.). CRC Press. ISBN 9781584886983.
  2. ^ This prior was proposed by J.B.S. Haldane in "A note on inverse probability", Mathematical Proceedings of the Cambridge Philosophical Society 28, 55–61, 1932, doi:10.1017/S0305004100010495. See also J. Haldane, "The precision of observed values of small frequencies", Biometrika, 35:297–300, 1948, doi:10.2307/2332350, JSTOR 2332350.
  3. ^ "Incorporation of Biological Pathway Knowledge in the Construction of Priors for Optimal Bayesian Classification - IEEE Journals & Magazine". Retrieved 2018-08-05.
  4. ^ Boluki, Shahin; Esfahani, Mohammad Shahrokh; Qian, Xiaoning; Dougherty, Edward R (December 2017). "Incorporating biological prior knowledge for Bayesian learning via maximal knowledge-driven information priors". BMC Bioinformatics. 18 (S14). doi:10.1186/s12859-017-1893-4. ISSN 1471-2105. PMC 5751802. PMID 29297278.
  5. ^ Jaynes (1968), pp. 17, see also Jaynes (2003), chapter 12. Note that chapter 12 is not available in the online preprint but can be previewed via Google Books.
  6. ^ Christensen, Ronald; Johnson, Wesley; Branscum, Adam; Hanson, Timothy E. (2010). Bayesian Ideas and Data Analysis : An Introduction for Scientists and Statisticians. Hoboken: CRC Press. p. 69. ISBN 9781439894798.


This page was last edited on 4 May 2019, at 14:04
Basis of this page is in Wikipedia. Text is available under the CC BY-SA 3.0 Unported License. Non-text media are available under their specified licenses. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc. WIKI 2 is an independent company and has no affiliation with Wikimedia Foundation.