Parameters  number of categories (integer) event probabilities 

Support  
pmf  (1) (2)

Mode 
In probability theory and statistics, a categorical distribution (also called a generalized Bernoulli distribution, multinoulli distribution^{[1]}) is a discrete probability distribution that describes the possible results of a random variable that can take on one of K possible categories, with the probability of each category separately specified. There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to K). The Kdimensional categorical distribution is the most general distribution over a Kway event; any other discrete distribution over a sizeK sample space is a special case. The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.
The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a die. On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.
YouTube Encyclopedic

1/5Views:1 046 9392 05595 09844 02635 221

✪ Discrete and continuous random variables  Probability and Statistics  Khan Academy

✪ Marginal and conditional distributions  Analyzing categorical data  AP Statistics  Khan Academy

✪ Discrete vs. Continuous Relationships Tutorial

✪ How to Generate Images  Intro to Deep Learning #14

✪ Excel 2013 Statistical Analysis #5 Data Categorical, Quantitative, Nominal, Ordinal, Interval, Ratio
Transcription
We already know a little bit about random variables. What we're going to see in this video is that random variables come in two varieties. You have discrete random variables, and you have continuous random variables. And discrete random variables, these are essentially random variables that can take on distinct or separate values. And we'll give examples of that in a second. So that comes straight from the meaning of the word discrete in the English language distinct or separate values. While continuous and I guess just another definition for the word discrete in the English language would be polite, or not obnoxious, or kind of subtle. That is not what we're talking about. We are not talking about random variables that are polite. We're talking about ones that can take on distinct values. And continuous random variables, they can take on any value in a range. And that range could even be infinite. So any value in an interval. So with those two definitions out of the way, let's look at some actual random variable definitions. And I want to think together about whether you would classify them as discrete or continuous random variables. So let's say that I have a random variable capital X. And it is equal to well, this is one that we covered in the last video. It's 1 if my fair coin is heads. It's 0 if my fair coin is tails. So is this a discrete or a continuous random variable? Well, this random variable right over here can take on distinctive values. It can take on either a 1 or it could take on a 0. Another way to think about it is you can count the number of different values it can take on. This is the first value it can take on, this is the second value that it can take on. So this is clearly a discrete random variable. Let's think about another one. Let's define random variable Y as equal to the mass of a random animal selected at the New Orleans zoo, where I grew up, the Audubon Zoo. Y is the mass of a random animal selected at the New Orleans zoo. Is this a discreet random variable or a continuous random variable? Well, the exact mass and I should probably put that qualifier here. I'll even add it here just to make it really, really clear. The exact mass of a random animal, or a random object in our universe, it can take on any of a whole set of values. I mean, who knows exactly the exact number of electrons that are part of that object right at that moment? Who knows the neutrons, the protons, the exact number of molecules in that object, or a part of that animal exactly at that moment? So that mass, for example, at the zoo, it might take on a value anywhere between well, maybe close to 0. There's no animal that has 0 mass. But it could be close to zero, if we're thinking about an ant, or we're thinking about a dust mite, or maybe if you consider even a bacterium an animal. I believe bacterium is the singular of bacteria. And it could go all the way. Maybe the most massive animal in the zoo is the elephant of some kind. And I don't know what it would be in kilograms, but it would be fairly large. So maybe you can get up all the way to 3,000 kilograms, or probably larger. Let's say 5,000 kilograms. I don't know what the mass of a very heavy elephant or a very massive elephant, I should say actually is. It may be something fun for you to look at. But any animal could have a mass anywhere in between here. It does not take on discrete values. You could have an animal that is exactly maybe 123.75921 kilograms. And even there, that actually might not be the exact mass. You might have to get even more precise, 10732. 0, 7, And I think you get the picture. Even though this is the way I've defined it now, a finite interval, you can take on any value in between here. They are not discrete values. So this one is clearly a continuous random variable. Let's think about another one. Let's think about let's say that random variable Y, instead of it being this, let's say it's the year that a random student in the class was born. Is this a discrete or a continuous random variable? Well, that year, you literally can define it as a specific discrete year. It could be 1992, or it could be 1985, or it could be 2001. There are discreet values that this random variable can actually take on. It won't be able to take on any value between, say, 2000 and 2001. It'll either be 2000 or it'll be 2001 or 2002. Once again, you can count the values it can take on. Most of the times that you're dealing with, as in the case right here, a discrete random variable let me make it clear this one over here is also a discreet random variable. Most of the time that you're dealing with a discrete random variable, you're probably going to be dealing with a finite number of values. But it does not have to be a finite number of values. You can actually have an infinite potential number of values that it could take on as long as the values are countable. As long as you can literally say, OK, this is the first value it could take on, the second, the third. And you might be counting forever, but as long as you can literally list and it could be even an infinite list. But if you can list the values that it could take on, then you're dealing with a discrete random variable. Notice in this scenario with the zoo, you could not list all of the possible masses. You could not even count them. You might attempt to and it's a fun exercise to try at least once, to try to list all of the values this might take on. You might say, OK, maybe it could take on 0.01 and maybe 0.02. But wait, you just skipped an infinite number of values that it could take on, because it could have taken on 0.011, 0.012. And even between those, there's an infinite number of values it could take on. There's no way for you to count the number of values that a continuous random variable can take on. There's no way for you to list them. With a discrete random variable, you can count the values. You can list the values. Let's do another example. Let's let random variable Z, capital Z, be the number ants born tomorrow in the universe. Now, you're probably arguing that there aren't ants on other planets. Or maybe there are antlike creatures, but they're not going to be ants as we define them. But how do we know? So number of ants born in the universe. Maybe some ants have figured out interstellar travel of some kind. So the number of ants born tomorrow in the universe. That's my random variable Z. Is this a discrete random variable or a continuous random variable? Well, once again, we can count the number of values this could take on. This could be 1. It could be 2. It could be 3. It could be 4. It could be 5 quadrillion ants. It could be 5 quadrillion and 1. We can actually count the values. Those values are discrete. So once again, this right over here is a discrete random variable. This is fun, so let's keep doing more of these. Let's say that I have random variable X. So we're not using this definition anymore. Now I'm going to define random variable X to be the winning time now let me write it this way. The exact winning time for the men's 100meter dash at the 2016 Olympics. So the exact time that it took for the winner who's probably going to be Usain Bolt, but it might not be. Actually, he's aging a little bit. But whatever the exact winning time for the men's 100meter in the 2016 Olympics. And not the one that you necessarily see on the clock. The exact, the precise time that you would see at the men's 100meter dash. Is this a discrete or a continuous random variable? Well, the way I've defined, and this one's a little bit tricky. Because you might say it's countable. You might say, well, it could either be 956, 9.56 seconds, or 9.57 seconds, or 9.58 seconds. And you might be tempted to believe that, because when you watch the 100meter dash at the Olympics, they measure it to the nearest hundredths. They round to the nearest hundredth. That's how precise their timing is. But I'm talking about the exact winning time, the exact number of seconds it takes for that person to, from the starting gun, to cross the finish line. And there, it can take on any value. It can take on any value between well, I guess they're limited by the speed of light. But it could take on any value you could imagine. It might be anywhere between 5 seconds and maybe 12 seconds. And it could be anywhere in between there. It might not be 9.57. That might be what the clock says, but in reality the exact winning time could be 9.571, or it could be 9.572359. I think you see what I'm saying. The exact precise time could be any value in an interval. So this right over here is a continuous random variable. Now what would be the case, instead of saying the exact winning time, if instead I defined X to be the winning time of the men's 100 meter dash at the 2016 Olympics rounded to the nearest hundredth? Is this a discrete or a continuous random variable? So let me delete this. I've changed the random variable now. Is this going to be a discrete or a continuous random variable? Well now, we can actually count the actual values that this random variable can take on. It might be 9.56. It could be 9.57. It could be 9.58. We can actually list them. So in this case, when we round it to the nearest hundredth, we can actually list of values. We are now dealing with a discrete random variable. Anyway, I'll let you go there. Hopefully this gives you a sense of the distinction between discrete and continuous random variables.
Contents
Terminology
Occasionally, the categorical distribution is termed the "discrete distribution". However, this properly refers not to one particular family of distributions but to a general class of distributions.
In some fields, such as machine learning and natural language processing, the categorical and multinomial distributions are conflated, and it is common to speak of a "multinomial distribution" when a "categorical distribution" would be more precise.^{[2]} This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1ofK" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below).
However, conflating the categorical and multinomial distributions can lead to problems. For example, in a Dirichletmultinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial. The joint distribution of the same variables with the same Dirichletmultinomial distribution has two different forms depending on whether it is characterized as a distribution whose domain is over individual categorical nodes or over multinomialstyle counts of nodes in each particular category (similar to the distinction between a set of Bernoullidistributed nodes and a single binomialdistributed node). Both forms have very similarlooking probability mass functions (PMFs), which both make reference to multinomialstyle counts of nodes in a category. However, the multinomialstyle PMF has an extra factor, a multinomial coefficient, that is a constant equal to 1 in the categoricalstyle PMF. Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest. The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.
Formulating distributions
A categorical distribution is a discrete probability distribution whose sample space is the set of k individually identified items. It is the generalization of the Bernoulli distribution for a categorical random variable.
In one formulation of the distribution, the sample space is taken to be a finite sequence of integers. The exact integers used as labels are unimportant; they might be {0, 1, ..., k − 1} or {1, 2, ..., k} or any other arbitrary set of values. In the following descriptions, we use {1, 2, ..., k} for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses {0, 1}. In this case, the probability mass function f is:
where , represents the probability of seeing element i and .
Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket:^{[3]}
where evaluates to 1 if , 0 otherwise. There are various advantages of this formulation, e.g.:
 It is easier to write out the likelihood function of a set of independent identically distributed categorical variables.
 It connects the categorical distribution with the related multinomial distribution.
 It shows why the Dirichlet distribution is the conjugate prior of the categorical distribution, and allows the posterior distribution of the parameters to be calculated.
Yet another formulation makes explicit the connection between the categorical and multinomial distributions by treating the categorical distribution as a special case of the multinomial distribution in which the parameter n of the multinomial distribution (the number of sampled items) is fixed at 1. In this formulation, the sample space can be considered to be the set of 1ofK encoded^{[4]} random vectors x of dimension k having the property that exactly one element has the value 1 and the others have the value 0. The particular element having the value 1 indicates which category has been chosen. The probability mass function f in this formulation is:
where represents the probability of seeing element i and . This is the formulation adopted by Bishop.^{[4]}^{[note 1]}
Properties
 The distribution is completely given by the probabilities associated with each number i: , i = 1,...,k, where . The possible sets of probabilities are exactly those in the standard dimensional simplex; for k = 2 this reduces to the possible probabilities of the Bernoulli distribution being the 1simplex,
 The distribution is a special case of a "multivariate Bernoulli distribution"^{[5]} in which exactly one of the k 01 variables takes the value one.
 Let be the realisation from a categorical distribution. Define the random vector Y as composed of the elements:
 where I is the indicator function. Then Y has a distribution which is a special case of the multinomial distribution with parameter . The sum of independent and identically distributed such random variables Y constructed from a categorical distribution with parameter is multinomially distributed with parameters and
 The conjugate prior distribution of a categorical distribution is a Dirichlet distribution.^{[2]} See the section below for more discussion.
 The sufficient statistic from n independent observations is the set of counts (or, equivalently, proportion) of observations in each category, where the total number of trials (=n) is fixed.
 The indicator function of an observation having a value i, equivalent to the Iverson bracket function or the Kronecker delta function is Bernoulli distributed with parameter
Bayesian inference using conjugate prior
In Bayesian statistics, the Dirichlet distribution is the conjugate prior distribution of the categorical distribution (and also the multinomial distribution). This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet. Intuitively, in such a case, starting from what is known about the parameter prior to observing the data point, knowledge can then be updated based on the data point, yielding a new distribution of the same form as the old one. As such, knowledge of a parameter can be successively updated by incorporating new observations one at a time, without running into mathematical difficulties.
Formally, this can be expressed as follows. Given a model
then the following holds:^{[2]}
This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples. Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen. Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.
Further intuition comes from the expected value of the posterior distribution (see the article on the Dirichlet distribution):
This says that the expected probability of seeing a category i among the various discrete distributions generated by the posterior distribution is simply equal to the proportion of occurrences of that category actually seen in the data, including the pseudocounts in the prior distribution. This makes a great deal of intuitive sense: if, for example, there are three possible categories, and category 1 is seen in the observed data 40% of the time, one would expect on average to see category 1 40% of the time in the posterior distribution as well.
(This intuition is ignoring the effect of the prior distribution. Furthermore, the posterior is a distribution over distributions. The posterior distribution in general describes the parameter in question, and in this case the parameter itself is a discrete probability distribution, i.e. the actual categorical distribution that generated the data. For example, if 3 categories in the ratio 40:5:55 are in the observed data, then ignoring the effect of the prior distribution, the true parameter – i.e. the true, underlying distribution that generated our observed data – would be expected to have the average value of (0.40,0.05,0.55), which is indeed what the posterior reveals. However, the true distribution might actually be (0.35,0.07,0.58) or (0.42,0.04,0.54) or various other nearby possibilities. The amount of uncertainty involved here is specified by the variance of the posterior, which is controlled by the total number of observations – the more data observed, the less uncertainty about the true parameter.)
(Technically, the prior parameter should actually be seen as representing prior observations of category . Then, the updated posterior parameter represents posterior observations. This reflects the fact that a Dirichlet distribution with has a completely flat shape — essentially, a uniform distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort. However, the mathematical updating of the posterior works fine if we ignore the term and simply think of the α vector as directly representing a set of pseudocounts. Furthermore, doing this avoids the issue of interpreting values less than 1.)
MAP estimation
The maximumaposteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e.,^{[2]}
In many practical applications, the only way to guarantee the condition that is to set for all i.
Marginal likelihood
In the above model, the marginal likelihood of the observations (i.e. the joint distribution of the observations, with the prior parameter marginalized out) is a Dirichletmultinomial distribution:^{[2]}
This distribution plays an important role in hierarchical Bayesian models, because when doing inference over such models using methods such as Gibbs sampling or variational Bayes, Dirichlet prior distributions are often marginalized out. See the article on this distribution for more details.
Posterior predictive distribution
The posterior predictive distribution of a new observation in the above model is the distribution that a new observation would take given the set of N categorical observations. As shown in the Dirichletmultinomial distribution article, it has a very simple form:^{[2]}
There are various relationships among this formula and the previous ones:
 The posterior predictive probability of seeing a particular category is the same as the relative proportion of previous observations in that category (including the pseudoobservations of the prior). This makes logical sense — intuitively, we would expect to see a particular category according to the frequency already observed of that category.
 The posterior predictive probability is the same as the expected value of the posterior distribution. This is explained more below.
 As a result, this formula can be expressed as simply "the posterior predictive probability of seeing a category is proportional to the total observed count of that category", or as "the expected count of a category is the same as the total observed count of the category", where "observed count" is taken to include the pseudoobservations of the prior.
The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident with reexamination of the above formula. As explained in the posterior predictive distribution article, the formula for the posterior predictive probability has the form of an expected value taken with respect to the posterior distribution:
The crucial line above is the third. The second follows directly from the definition of expected value. The third line is particular to the categorical distribution, and follows from the fact that, in the categorical distribution specifically, the expected value of seeing a particular value i is directly specified by the associated parameter p_{i}. The fourth line is simply a rewriting of the third in a different notation, using the notation farther up for an expectation taken with respect to the posterior distribution of the parameters.
Observe data points one by one and each time consider their predictive probability before observing the data point and updating the posterior. For any given data point, the probability of that point assuming a given category depends on the number of data points already in that category. In this scenario, if a category has a high frequency of occurrence, then new data points are more likely to join that category — further enriching the same category. This type of scenario is often termed a preferential attachment (or "rich get richer") model. This models many realworld processes, and in such cases the choices made by the first few data points have an outsize influence on the rest of the data points.
Posterior conditional distribution
In Gibbs sampling, one typically needs to draw from conditional distributions in multivariable Bayes networks where each variable is conditioned on all the others. In networks that include categorical variables with Dirichlet priors (e.g. mixture models and models including mixture components), the Dirichlet distributions are often "collapsed out" (marginalized out) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their joint distribution is a Dirichletmultinomial distribution). One of the reasons for doing this is that in such a case, the distribution of one categorical node given the others is exactly the posterior predictive distribution of the remaining nodes.
That is, for a set of nodes , if the node in question is denoted as and the remainder as , then
where is the number of nodes having category i among the nodes other than node n.
Sampling
There are a number of methods, but the most common way to sample from a categorical distribution uses a type of inverse transform sampling:
Assume a distribution is expressed as "proportional to" some expression, with unknown normalizing constant. Before taking any samples, one prepares some values as follows:
 Compute the unnormalized value of the distribution for each category.
 Sum them up and divide each value by this sum, in order to normalize them.
 Impose some sort of order on the categories (e.g. by an index that runs from 1 to k, where k is the number of categories).
 Convert the values to a cumulative distribution function (CDF) by replacing each value with the sum of all of the previous values. This can be done in time O(k). The resulting value for the first category will be 0.
Then, each time it is necessary to sample a value:
 Pick a uniformly distributed number between 0 and 1.
 Locate the greatest number in the CDF whose value is less than or equal to the number just chosen. This can be done in time O(log(k)), by binary search.
 Return the category corresponding to this CDF value.
If it is necessary to draw many values from the same categorical distribution, the following approach is more efficient. It draws n samples in O(n) time (assuming an O(1) approximation is used to draw values from the binomial distribution^{[6]}).
function draw_categorical(n) // where n is the number of samples to draw from the categorical distribution r = 1 s = 0 for i from 1 to k // where k is the number of categories v = draw from a binomial(n, p[i] / r) distribution // where p[i] is the probability of category i for j from 1 to v z[s++] = i // where z is an array in which the results are stored n = n  v r = r  p[i] shuffle (randomly reorder) the elements in z return z
Sampling via the Gumbel distribution
In machine learning it is typical to parametrize the categorical distribution, via an unconstrained representation in , whose components are given by:
where is any real constant. Given this representation, can be recovered using the softmax function, which can then be sampled using the techniques described above. There is however a more direct sampling method that uses samples from the Gumbel distribution.^{[7]} Let be k independent draws from the standard Gumbel distribution, then
will be a sample from the desired categorical distribution. (If is a sample from the standard uniform distribution, then is a sample from the standard Gumbel distribution.)
See also
Related distributions
 Dirichlet distribution
 Multinomial distribution
 Bernoulli distribution
 Dirichletmultinomial distribution
Notes
 ^ However, Bishop does not explicitly use the term categorical distribution.
References
 ^ Murphy, K. P. (2012). Machine learning: a probabilistic perspective, p. 35. MIT press. ISBN 0262018020.
 ^ ^{a} ^{b} ^{c} ^{d} ^{e} ^{f} Minka, T. (2003) Bayesian inference, entropy and the multinomial distribution. Technical report Microsoft Research.
 ^ Minka, T. (2003), op. cit. Minka uses the Kronecker delta function, similar to but less general than the Iverson bracket.
 ^ ^{a} ^{b} Bishop, C. (2006) Pattern Recognition and Machine Learning, Springer. ISBN 0387310738.
 ^ Johnson, N.L., Kotz, S., Balakrishnan, N. (1997) Discrete Multivariate Distributions, Wiley. ISBN 0471128449 (p. 105)
 ^ Agresti, A., An Introduction to Categorical Data Analysis, WileyInterscience, 2007, ISBN 9780471226185, pp. 25
 ^ Adams, Ryan. "The Gumbel–Max Trick for Discrete Distributions".