Probability density function  
Cumulative distribution function  
Notation  Beta(α, β)  

Parameters 
α > 0 shape (real) β > 0 shape (real)  
Support  or  
where and is the Gamma function.  
CDF 
(the regularised incomplete beta function)  
Mean 
(see digamma function and see section: Geometric mean)  
Median  
Mode 
for α, β > 1 any value in for α, β = 1 {0, 1} (bimodal) for α, β < 1 0 for α ≤ 1, β > 1 1 for α > 1, β ≤ 1  
Variance 
(see trigamma function and see section: Geometric variance)  
Skewness  
Ex. kurtosis  
Entropy  
MGF  
CF  (see Confluent hypergeometric function)  
Fisher information 
see section: Fisher information matrix  
Method of Moments 

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. The generalization to multiple variables is called a Dirichlet distribution.
The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines.
In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and geometric distributions. The beta distribution is a suitable model for the random behavior of percentages and proportions.
The formulation of the beta distribution discussed here is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution.
Definitions
Probability density function
The probability density function (pdf) of the beta distribution, for 0 ≤ x ≤ 1, and shape parameters α, β > 0, is a power function of the variable x and of its reflection (1 − x) as follows:
where Γ(z) is the gamma function. The beta function, , is a normalization constant to ensure that the total probability is 1. In the above equations x is a realization—an observed value that actually occurred—of a random process X.
This definition includes both ends x = 0 and x = 1, which is consistent with definitions for other continuous distributions supported on a bounded interval which are special cases of the beta distribution, for example the arcsine distribution, and consistent with several authors, like N. L. Johnson and S. Kotz.^{[1]}^{[2]}^{[3]}^{[4]} However, the inclusion of x = 0 and x = 1 does not work for α, β < 1; accordingly, several other authors, including W. Feller,^{[5]}^{[6]}^{[7]} choose to exclude the ends x = 0 and x = 1, (so that the two ends are not actually part of the domain of the density function) and consider instead 0 < x < 1.
Several authors, including N. L. Johnson and S. Kotz,^{[1]} use the symbols p and q (instead of α and β) for the shape parameters of the beta distribution, reminiscent of the symbols traditionally used for the parameters of the Bernoulli distribution, because the beta distribution approaches the Bernoulli distribution in the limit when both shape parameters α and β approach the value of zero.
In the following, a random variable X betadistributed with parameters α and β will be denoted by:^{[8]}^{[9]}
Other notations for betadistributed random variables used in the statistical literature are ^{[10]} and .^{[5]}
Cumulative distribution function
The cumulative distribution function is
where is the incomplete beta function and is the regularized incomplete beta function.
Alternative parametrizations
Two parameters
Mean and sample size
The beta distribution may also be reparameterized in terms of its mean μ (0 < μ < 1) and the addition of both shape parameters ν = α + β > 0(^{[9]} p. 83). Denoting by αPosterior and βPosterior the shape parameters of the posterior beta distribution resulting from applying Bayes theorem to a binomial likelihood function and a prior probability, the interpretation of the addition of both shape parameters to be sample size = ν = α·Posterior + β·Posterior is only correct for the Haldane prior probability Beta(0,0). Specifically, for the Bayes (uniform) prior Beta(1,1) the correct interpretation would be sample size = α·Posterior + β Posterior − 2, or ν = (sample size) + 2. Of course, for sample size much larger than 2, the difference between these two priors becomes negligible. (See section Bayesian inference for further details.) In the rest of this article ν = α + β will be referred to as "sample size", but one should remember that it is, strictly speaking, the "sample size" of a binomial likelihood function only when using a Haldane Beta(0,0) prior in Bayes theorem.
This parametrization may be useful in Bayesian parameter estimation. For example, one may administer a test to a number of individuals. If it is assumed that each person's score (0 ≤ θ ≤ 1) is drawn from a populationlevel Beta distribution, then an important statistic is the mean of this populationlevel distribution. The mean and sample size parameters are related to the shape parameters α and β via^{[9]}
 α = μν, β = (1 − μ)ν
Under this parametrization, one may place an uninformative prior probability over the mean, and a vague prior probability (such as an exponential or gamma distribution) over the positive reals for the sample size, if they are independent, and prior data and/or beliefs justify it.
Mode and concentration
The mode and "concentration" can also be used to calculate the parameters for a beta distribution.^{[11]}
Mean (allele frequency) and (Wright's) genetic distance between two populations
The Balding–Nichols model^{[12]} is a twoparameter parametrization of the beta distribution used in population genetics. It is a statistical description of the allele frequencies in the components of a subdivided population:
where and ; here F is (Wright's) genetic distance between two populations.
See the articles Balding–Nichols model, Fstatistics, fixation index and coefficient of relationship, for further information.
Mean and variance
Solving the system of (coupled) equations given in the above sections as the equations for the mean and the variance of the beta distribution in terms of the original parameters α and β, one can express the α and β parameters in terms of the mean (μ) and the variance (var):
This parametrization of the beta distribution may lead to a more intuitive understanding than the one based on the original parameters α and β. For example, by expressing the mode, skewness, excess kurtosis and differential entropy in terms of the mean and the variance:
Four parameters
A beta distribution with the two shape parameters α and β is supported on the range [0,1] or (0,1). It is possible to alter the location and scale of the distribution by introducing two further parameters representing the minimum, a, and maximum c (c > a), values of the distribution,^{[1]} by a linear transformation substituting the nondimensional variable x in terms of the new variable y (with support [a,c] or (a,c)) and the parameters a and c:
The probability density function of the four parameter beta distribution is equal to the two parameter distribution, scaled by the range (ca), (so that the total area under the density curve equals a probability of one), and with the "y" variable shifted and scaled as follows:
That a random variable Y is Betadistributed with four parameters α, β, a, and c will be denoted by:
The measures of central location are scaled (by (ca)) and shifted (by a), as follows:
(the geometric mean and harmonic mean cannot be transformed by a linear transformation in the way that the mean, median and mode can.)
The shape parameters of Y can be written in term of its mean and variance as
The statistical dispersion measures are scaled (they do not need to be shifted because they are already centered on the mean) by the range (ca), linearly for the mean deviation and nonlinearly for the variance:
Since the skewness and excess kurtosis are nondimensional quantities (as moments centered on the mean and normalized by the standard deviation), they are independent of the parameters a and c, and therefore equal to the expressions given above in terms of X (with support [0,1] or (0,1)):
Properties
Measures of central tendency
Mode
The mode of a Beta distributed random variable X with α, β > 1 is the most likely value of the distribution (corresponding to the peak in the PDF), and is given by the following expression:^{[1]}
When both parameters are less than one (α, β < 1), this is the antimode: the lowest point of the probability density curve.^{[3]}
Letting α = β, the expression for the mode simplifies to 1/2, showing that for α = β > 1 the mode (resp. antimode when α, β < 1), is at the center of the distribution: it is symmetric in those cases. See Shapes section in this article for a full list of mode cases, for arbitrary values of α and β. For several of these cases, the maximum value of the density function occurs at one or both ends. In some cases the (maximum) value of the density function occurring at the end is finite. For example, in the case of α = 2, β = 1 (or α = 1, β = 2), the density function becomes a righttriangle distribution which is finite at both ends. In several other cases there is a singularity at one end, where the value of the density function approaches infinity. For example, in the case α = β = 1/2, the Beta distribution simplifies to become the arcsine distribution. There is debate among mathematicians about some of these cases and whether the ends (x = 0, and x = 1) can be called modes or not.^{[6]}^{[8]}
 Whether the ends are part of the domain of the density function
 Whether a singularity can ever be called a mode
 Whether cases with two maxima should be called bimodal
Median
The median of the beta distribution is the unique real number for which the regularized incomplete beta function . There is no general closedform expression for the median of the beta distribution for arbitrary values of α and β. Closedform expressions for particular values of the parameters α and β follow:^{[citation needed]}
 For symmetric cases α = β, median = 1/2.
 For α = 1 and β > 0, median (this case is the mirrorimage of the power function [0,1] distribution)
 For α > 0 and β = 1, median = (this case is the power function [0,1] distribution^{[6]})
 For α = 3 and β = 2, median = 0.6142724318676105..., the real solution to the quartic equation 1 − 8x^{3} + 6x^{4} = 0, which lies in [0,1].
 For α = 2 and β = 3, median = 0.38572756813238945... = 1−median(Beta(3, 2))
The following are the limits with one parameter finite (nonzero) and the other approaching these limits:^{[citation needed]}
A reasonable approximation of the value of the median of the beta distribution, for both α and β greater or equal to one, is given by the formula^{[13]}
When α, β ≥ 1, the relative error (the absolute error divided by the median) in this approximation is less than 4% and for both α ≥ 2 and β ≥ 2 it is less than 1%. The absolute error divided by the difference between the mean and the mode is similarly small:
Mean
The expected value (mean) (μ) of a Beta distribution random variable X with two parameters α and β is a function of only the ratio β/α of these parameters:^{[1]}
Letting α = β in the above expression one obtains μ = 1/2, showing that for α = β the mean is at the center of the distribution: it is symmetric. Also, the following limits can be obtained from the above expression:
Therefore, for β/α → 0, or for α/β → ∞, the mean is located at the right end, x = 1. For these limit ratios, the beta distribution becomes a onepoint degenerate distribution with a Dirac delta function spike at the right end, x = 1, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the right end, x = 1.
Similarly, for β/α → ∞, or for α/β → 0, the mean is located at the left end, x = 0. The beta distribution becomes a 1point Degenerate distribution with a Dirac delta function spike at the left end, x = 0, with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the left end, x = 0. Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
While for typical unimodal distributions (with centrally located modes, inflexion points at both sides of the mode, and longer tails) (with Beta(α, β) such that α, β > 2) it is known that the sample mean (as an estimate of location) is not as robust as the sample median, the opposite is the case for uniform or "Ushaped" bimodal distributions (with Beta(α, β) such that α, β ≤ 1), with the modes located at the ends of the distribution. As Mosteller and Tukey remark (^{[14]} p. 207) "the average of the two extreme observations uses all the sample information. This illustrates how, for shorttailed distributions, the extreme observations should get more weight." By contrast, it follows that the median of "Ushaped" bimodal distributions with modes at the edge of the distribution (with Beta(α, β) such that α, β ≤ 1) is not robust, as the sample median drops the extreme sample observations from consideration. A practical application of this occurs for example for random walks, since the probability for the time of the last visit to the origin in a random walk is distributed as the arcsine distribution Beta(1/2, 1/2):^{[5]}^{[15]} the mean of a number of realizations of a random walk is a much more robust estimator than the median (which is an inappropriate sample measure estimate in this case).
Geometric mean
The logarithm of the geometric mean G_{X} of a distribution with random variable X is the arithmetic mean of ln(X), or, equivalently, its expected value:
For a beta distribution, the expected value integral gives:
where ψ is the digamma function.
Therefore, the geometric mean of a beta distribution with shape parameters α and β is the exponential of the digamma functions of α and β as follows:
While for a beta distribution with equal shape parameters α = β, it follows that skewness = 0 and mode = mean = median = 1/2, the geometric mean is less than 1/2: 0 < G_{X} < 1/2. The reason for this is that the logarithmic transformation strongly weights the values of X close to zero, as ln(X) strongly tends towards negative infinity as X approaches zero, while ln(X) flattens towards zero as X → 1.
Along a line α = β, the following limits apply:
Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
The accompanying plot shows the difference between the mean and the geometric mean for shape parameters α and β from zero to 2. Besides the fact that the difference between them approaches zero as α and β approach infinity and that the difference becomes large for values of α and β approaching zero, one can observe an evident asymmetry of the geometric mean with respect to the shape parameters α and β. The difference between the geometric mean and the mean is larger for small values of α in relation to β than when exchanging the magnitudes of β and α.
N. L.Johnson and S. Kotz^{[1]} suggest the logarithmic approximation to the digamma function ψ(α) ≈ ln(α − 1/2) which results in the following approximation to the geometric mean:
Numerical values for the relative error in this approximation follow: [(α = β = 1): 9.39%]; [(α = β = 2): 1.29%]; [(α = 2, β = 3): 1.51%]; [(α = 3, β = 2): 0.44%]; [(α = β = 3): 0.51%]; [(α = β = 4): 0.26%]; [(α = 3, β = 4): 0.55%]; [(α = 4, β = 3): 0.24%].
Similarly, one can calculate the value of shape parameters required for the geometric mean to equal 1/2. Given the value of the parameter β, what would be the value of the other parameter, α, required for the geometric mean to equal 1/2?. The answer is that (for β > 1), the value of α required tends towards β + 1/2 as β → ∞. For example, all these couples have the same geometric mean of 1/2: [β = 1, α = 1.4427], [β = 2, α = 2.46958], [β = 3, α = 3.47943], [β = 4, α = 4.48449], [β = 5, α = 5.48756], [β = 10, α = 10.4938], [β = 100, α = 100.499].
The fundamental property of the geometric mean, which can be proven to be false for any other mean, is
This makes the geometric mean the only correct mean when averaging normalized results, that is results that are presented as ratios to reference values.^{[16]} This is relevant because the beta distribution is a suitable model for the random behavior of percentages and it is particularly suitable to the statistical modelling of proportions. The geometric mean plays a central role in maximum likelihood estimation, see section "Parameter estimation, maximum likelihood." Actually, when performing maximum likelihood estimation, besides the geometric mean G_{X} based on the random variable X, also another geometric mean appears naturally: the geometric mean based on the linear transformation ––(1 − X), the mirrorimage of X, denoted by G_{(1−X)}:
Along a line α = β, the following limits apply:
Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
It has the following approximate value:
Although both G_{X} and G_{(1−X)} are asymmetric, in the case that both shape parameters are equal α = β, the geometric means are equal: G_{X} = G_{(1−X)}. This equality follows from the following symmetry displayed between both geometric means:
Harmonic mean
The inverse of the harmonic mean (H_{X}) of a distribution with random variable X is the arithmetic mean of 1/X, or, equivalently, its expected value. Therefore, the harmonic mean (H_{X}) of a beta distribution with shape parameters α and β is:
The harmonic mean (H_{X}) of a Beta distribution with α < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter α less than unity.
Letting α = β in the above expression one obtains
showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞.
Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
The harmonic mean plays a role in maximum likelihood estimation for the four parameter case, in addition to the geometric mean. Actually, when performing maximum likelihood estimation for the four parameter case, besides the harmonic mean H_{X} based on the random variable X, also another harmonic mean appears naturally: the harmonic mean based on the linear transformation (1 − X), the mirrorimage of X, denoted by H_{1 − X}:
The harmonic mean (H_{(1 − X)}) of a Beta distribution with β < 1 is undefined, because its defining expression is not bounded in [0, 1] for shape parameter β less than unity.
Letting α = β in the above expression one obtains
showing that for α = β the harmonic mean ranges from 0, for α = β = 1, to 1/2, for α = β → ∞.
Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
Although both H_{X} and H_{1−X} are asymmetric, in the case that both shape parameters are equal α = β, the harmonic means are equal: H_{X} = H_{1−X}. This equality follows from the following symmetry displayed between both harmonic means:
Measures of statistical dispersion
Variance
The variance (the second moment centered on the mean) of a Beta distribution random variable X with parameters α and β is:^{[1]}^{[17]}
Letting α = β in the above expression one obtains
showing that for α = β the variance decreases monotonically as α = β increases. Setting α = β = 0 in this expression, one finds the maximum variance var(X) = 1/4^{[1]} which only occurs approaching the limit, at α = β = 0.
The beta distribution may also be parametrized in terms of its mean μ (0 < μ < 1) and sample size ν = α + β (ν > 0) (see section below titled "Mean and sample size"):
Using this parametrization, one can express the variance in terms of the mean μ and the sample size ν as follows:
Since ν = (α + β) > 0, it must follow that var(X) < μ(1 − μ).
For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore:
Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
Geometric variance and covariance
The logarithm of the geometric variance, ln(var_{GX}), of a distribution with random variable X is the second moment of the logarithm of X centered on the geometric mean of X, ln(G_{X}):
and therefore, the geometric variance is:
In the Fisher information matrix, and the curvature of the log likelihood function, the logarithm of the geometric variance of the reflected variable 1 − X and the logarithm of the geometric covariance between X and 1 − X appear:
For a beta distribution, higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order polygamma functions. See the section titled "Other moments, Moments of transformed random variables, Moments of logarithmically transformed random variables". The variance of the logarithmic variables and covariance of ln X and ln(1−X) are:
where the trigamma function, denoted ψ_{1}(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function:
Therefore,
The accompanying plots show the log geometric variances and log geometric covariance versus the shape parameters α and β. The plots show that the log geometric variances and log geometric covariance are close to zero for shape parameters α and β greater than 2, and that the log geometric variances rapidly rise in value for shape parameter values α and β less than unity. The log geometric variances are positive for all values of the shape parameters. The log geometric covariance is negative for all values of the shape parameters, and it reaches large negative values for α and β less than unity.
Following are the limits with one parameter finite (nonzero) and the other approaching these limits:
Limits with two parameters varying:
Although both ln(var_{GX}) and ln(var_{G(1 − X)}) are asymmetric, when the shape parameters are equal, α = β, one has: ln(var_{GX}) = ln(var_{G(1−X)}). This equality follows from the following symmetry displayed between both log geometric variances:
The log geometric covariance is symmetric:
Mean absolute deviation around the mean
The mean absolute deviation around the mean for the beta distribution with shape parameters α and β is:^{[6]}
The mean absolute deviation around the mean is a more robust estimator of statistical dispersion than the standard deviation for beta distributions with tails and inflection points at each side of the mode, Beta(α, β) distributions with α,β > 2, as it depends on the linear (absolute) deviations rather than the square deviations from the mean. Therefore, the effect of very large deviations from the mean are not as overly weighted.
Using Stirling's approximation to the Gamma function, N.L.Johnson and S.Kotz^{[1]} derived the following approximation for values of the shape parameters greater than unity (the relative error for this approximation is only −3.5% for α = β = 1, and it decreases to zero as α → ∞, β → ∞):
At the limit α → ∞, β → ∞, the ratio of the mean absolute deviation to the standard deviation (for the beta distribution) becomes equal to the ratio of the same measures for the normal distribution: . For α = β = 1 this ratio equals , so that from α = β = 1 to α, β → ∞ the ratio decreases by 8.5%. For α = β = 0 the standard deviation is exactly equal to the mean absolute deviation around the mean. Therefore, this ratio decreases by 15% from α = β = 0 to α = β = 1, and by 25% from α = β = 0 to α, β → ∞ . However, for skewed beta distributions such that α → 0 or β → 0, the ratio of the standard deviation to the mean absolute deviation approaches infinity (although each of them, individually, approaches zero) because the mean absolute deviation approaches zero faster than the standard deviation.
Using the parametrization in terms of mean μ and sample size ν = α + β > 0:
 α = μν, β = (1−μ)ν
one can express the mean absolute deviation around the mean in terms of the mean μ and the sample size ν as follows:
For a symmetric distribution, the mean is at the middle of the distribution, μ = 1/2, and therefore:
Also, the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
Mean absolute difference
The mean absolute difference for the Beta distribution is:
The Gini coefficient for the Beta distribution is half of the relative mean absolute difference:
Skewness
The skewness (the third moment centered on the mean, normalized by the 3/2 power of the variance) of the beta distribution is^{[1]}
Letting α = β in the above expression one obtains γ_{1} = 0, showing once again that for α = β the distribution is symmetric and hence the skewness is zero. Positive skew (righttailed) for α < β, negative skew (lefttailed) for α > β.
Using the parametrization in terms of mean μ and sample size ν = α + β:
one can express the skewness in terms of the mean μ and the sample size ν as follows:
The skewness can also be expressed just in terms of the variance var and the mean μ as follows:
The accompanying plot of skewness as a function of variance and mean shows that maximum variance (1/4) is coupled with zero skewness and the symmetry condition (μ = 1/2), and that maximum skewness (positive or negative infinity) occurs when the mean is located at one end or the other, so that the "mass" of the probability distribution is concentrated at the ends (minimum variance).
The following expression for the square of the skewness, in terms of the sample size ν = α + β and the variance var, is useful for the method of moments estimation of four parameters:
This expression correctly gives a skewness of zero for α = β, since in that case (see section titled "Variance"): .
For the symmetric case (α = β), skewness = 0 over the whole range, and the following limits apply:
For the asymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
Kurtosis
The beta distribution has been applied in acoustic analysis to assess damage to gears, as the kurtosis of the beta distribution has been reported to be a good indicator of the condition of a gear.^{[18]} Kurtosis has also been used to distinguish the seismic signal generated by a person's footsteps from other signals. As persons or other targets moving on the ground generate continuous signals in the form of seismic waves, one can separate different targets based on the seismic waves they generate. Kurtosis is sensitive to impulsive signals, so it's much more sensitive to the signal generated by human footsteps than other signals generated by vehicles, winds, noise, etc.^{[19]} Unfortunately, the notation for kurtosis has not been standardized. Kenney and Keeping^{[20]} use the symbol γ_{2} for the excess kurtosis, but Abramowitz and Stegun^{[21]} use different terminology. To prevent confusion^{[22]} between kurtosis (the fourth moment centered on the mean, normalized by the square of the variance) and excess kurtosis, when using symbols, they will be spelled out as follows:^{[6]}^{[7]}
Letting α = β in the above expression one obtains
 .
Therefore, for symmetric beta distributions, the excess kurtosis is negative, increasing from a minimum value of −2 at the limit as {α = β} → 0, and approaching a maximum value of zero as {α = β} → ∞. The value of −2 is the minimum value of excess kurtosis that any distribution (not just beta distributions, but any distribution of any possible kind) can ever achieve. This minimum value is reached when all the probability density is entirely concentrated at each end x = 0 and x = 1, with nothing in between: a 2point Bernoulli distribution with equal probability 1/2 at each end (a coin toss: see section below "Kurtosis bounded by the square of the skewness" for further discussion). The description of kurtosis as a measure of the "potential outliers" (or "potential rare, extreme values") of the probability distribution, is correct for all distributions including the beta distribution. When rare, extreme values can occur in the beta distribution, the higher its kurtosis; otherwise, the kurtosis is lower. For α ≠ β, skewed beta distributions, the excess kurtosis can reach unlimited positive values (particularly for α → 0 for finite β, or for β → 0 for finite α) because the side away from the mode will produce occasional extreme values. Minimum kurtosis takes place when the mass density is concentrated equally at each end (and therefore the mean is at the center), and there is no probability mass density in between the ends.
Using the parametrization in terms of mean μ and sample size ν = α + β:
one can express the excess kurtosis in terms of the mean μ and the sample size ν as follows:
The excess kurtosis can also be expressed in terms of just the following two parameters: the variance var, and the sample size ν as follows:
and, in terms of the variance var and the mean μ as follows:
The plot of excess kurtosis as a function of the variance and the mean shows that the minimum value of the excess kurtosis (−2, which is the minimum possible value for excess kurtosis for any distribution) is intimately coupled with the maximum value of variance (1/4) and the symmetry condition: the mean occurring at the midpoint (μ = 1/2). This occurs for the symmetric case of α = β = 0, with zero skewness. At the limit, this is the 2 point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. (A coin toss: one face of the coin being x = 0 and the other face being x = 1.) Variance is maximum because the distribution is bimodal with nothing in between the two modes (spikes) at each end. Excess kurtosis is minimum: the probability density "mass" is zero at the mean and it is concentrated at the two peaks at each end. Excess kurtosis reaches the minimum possible value (for any distribution) when the probability density function has two spikes at each end: it is bi"peaky" with nothing in between them.
On the other hand, the plot shows that for extreme skewed cases, where the mean is located near one or the other end (μ = 0 or μ = 1), the variance is close to zero, and the excess kurtosis rapidly approaches infinity when the mean of the distribution approaches either end.
Alternatively, the excess kurtosis can also be expressed in terms of just the following two parameters: the square of the skewness, and the sample size ν as follows:
From this last expression, one can obtain the same limits published practically a century ago by Karl Pearson in his paper,^{[23]} for the beta distribution (see section below titled "Kurtosis bounded by the square of the skewness"). Setting α + β= ν = 0 in the above expression, one obtains Pearson's lower boundary (values for the skewness and excess kurtosis below the boundary (excess kurtosis + 2 − skewness^{2} = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region"). The limit of α + β = ν → ∞ determines Pearson's upper boundary.
therefore:
Values of ν = α + β such that ν ranges from zero to infinity, 0 < ν < ∞, span the whole region of the beta distribution in the plane of excess kurtosis versus squared skewness.
For the symmetric case (α = β), the following limits apply:
For the unsymmetric cases (α ≠ β) the following limits (with only the noted variable approaching the limit) can be obtained from the above expressions:
Characteristic function
The characteristic function is the Fourier transform of the probability density function. The characteristic function of the beta distribution is Kummer's confluent hypergeometric function (of the first kind):^{[1]}^{[21]}^{[24]}
where
is the rising factorial, also called the "Pochhammer symbol". The value of the characteristic function for t = 0, is one:
 .
Also, the real and imaginary parts of the characteristic function enjoy the following symmetries with respect to the origin of variable t:
The symmetric case α = β simplifies the characteristic function of the beta distribution to a Bessel function, since in the special case α + β = 2α the confluent hypergeometric function (of the first kind) reduces to a Bessel function (the modified Bessel function of the first kind ) using Kummer's second transformation as follows:
In the accompanying plots, the real part (Re) of the characteristic function of the beta distribution is displayed for symmetric (α = β) and skewed (α ≠ β) cases.
Other moments
Moment generating function
It also follows^{[1]}^{[6]} that the moment generating function is
In particular M_{X}(α; β; 0) = 1.
Higher moments
Using the moment generating function, the kth raw moment is given by^{[1]} the factor
multiplying the (exponential series) term in the series of the moment generating function
where (x)^{(k)} is a Pochhammer symbol representing rising factorial. It can also be written in a recursive form as
Since the moment generating function has a positive radius of convergence, the beta distribution is determined by its moments.^{[25]}
Moments of transformed random variables
Moments of linearly transformed, product and inverted random variables
One can also show the following expectations for a transformed random variable,^{[1]} where the random variable X is Betadistributed with parameters α and β: X ~ Beta(α, β). The expected value of the variable 1 − X is the mirrorsymmetry of the expected value based on X:
Due to the mirrorsymmetry of the probability density function of the beta distribution, the variances based on variables X and 1 − X are identical, and the covariance on X(1 − X is the negative of the variance:
These are the expected values for inverted variables, (these are related to the harmonic means, see section titled "Harmonic mean"):
The following transformation by dividing the variable X by its mirrorimage X/(1 − X) results in the expected value of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):^{[1]}
Variances of these transformed variables can be obtained by integration, as the expected values of the second moments centered on the corresponding variables:
The following variance of the variable X divided by its mirrorimage (X/(1−X) results in the variance of the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI):^{[1]}
The covariances are:
These expectations and variances appear in the fourparameter Fisher information matrix (section titled "Fisher information," "four parameters")
Moments of logarithmically transformed random variables
Expected values for logarithmic transformations (useful for maximum likelihood estimates, see section titled "Parameter estimation, Maximum likelihood" below) are discussed in this section. The following logarithmic linear transformations are related to the geometric means G_{X} and G_{(1−X)} (see section titled "Geometric mean"):
Where the digamma function ψ(α) is defined as the logarithmic derivative of the gamma function:^{[21]}
Logit transformations are interesting,^{[26]} as they usually transform various shapes (including Jshapes) into (usually skewed) bellshaped densities over the logit variable, and they may remove the end singularities over the original variable:
Johnson^{[27]} considered the distribution of the logit  transformed variable ln(X/1−X), including its moment generating function and approximations for large values of the shape parameters. This transformation extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞).
Higher order logarithmic moments can be derived by using the representation of a beta distribution as a proportion of two Gamma distributions and differentiating through the integral. They can be expressed in terms of higher order polygamma functions as follows:
therefore the variance of the logarithmic variables and covariance of ln(X) and ln(1−X) are:
where the trigamma function, denoted ψ_{1}(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function:
 .
The variances and covariance of the logarithmically transformed variables X and (1−X) are different, in general, because the logarithmic transformation destroys the mirrorsymmetry of the original variables X and (1−X), as the logarithm approaches negative infinity for the variable approaching zero.
These logarithmic variances and covariance are the elements of the Fisher information matrix for the beta distribution. They are also a measure of the curvature of the log likelihood function (see section on Maximum likelihood estimation).
The variances of the log inverse variables are identical to the variances of the log variables:
It also follows that the variances of the logit transformed variables are:
Quantities of information (entropy)
Given a beta distributed random variable, X ~ Beta(α, β), the differential entropy of X is^{[28]}(measured in nats), the expected value of the negative of the logarithm of the probability density function:
where f(x; α, β) is the probability density function of the beta distribution:
The digamma function ψ appears in the formula for the differential entropy as a consequence of Euler's integral formula for the harmonic numbers which follows from the integral:
The differential entropy of the beta distribution is negative for all values of α and β greater than zero, except at α = β = 1 (for which values the beta distribution is the same as the uniform distribution), where the differential entropy reaches its maximum value of zero. It is to be expected that the maximum entropy should take place when the beta distribution becomes equal to the uniform distribution, since uncertainty is maximal when all possible events are equiprobable.
For α or β approaching zero, the differential entropy approaches its minimum value of negative infinity. For (either or both) α or β approaching zero, there is a maximum amount of order: all the probability density is concentrated at the ends, and there is zero probability density at points located between the ends. Similarly for (either or both) α or β approaching infinity, the differential entropy approaches its minimum value of negative infinity, and a maximum amount of order. If either α or β approaches infinity (and the other is finite) all the probability density is concentrated at an end, and the probability density is zero everywhere else. If both shape parameters are equal (the symmetric case), α = β, and they approach infinity simultaneously, the probability density becomes a spike (Dirac delta function) concentrated at the middle x = 1/2, and hence there is 100% probability at the middle x = 1/2 and zero probability everywhere else.
The (continuous case) differential entropy was introduced by Shannon in his original paper (where he named it the "entropy of a continuous distribution"), as the concluding part^{[29]} of the same paper where he defined the discrete entropy. It is known since then that the differential entropy may differ from the infinitesimal limit of the discrete entropy by an infinite offset, therefore the differential entropy can be negative (as it is for the beta distribution). What really matters is the relative value of entropy.
Given two beta distributed random variables, X_{1} ~ Beta(α, β) and X_{2} ~ Beta(α′, β′), the cross entropy is (measured in nats)^{[30]}
The cross entropy has been used as an error metric to measure the distance between two hypotheses.^{[31]}^{[32]} Its absolute value is minimum when the two distributions are identical. It is the information measure most closely related to the log maximum likelihood ^{[30]}(see section on "Parameter estimation. Maximum likelihood estimation")).
The relative entropy, or Kullback–Leibler divergence D_{KL}(X_{1}  X_{2}), is a measure of the inefficiency of assuming that the distribution is X_{2} ~ Beta(α′, β′) when the distribution is really X_{1} ~ Beta(α, β). It is defined as follows (measured in nats).
The relative entropy, or Kullback–Leibler divergence, is always nonnegative. A few numerical examples follow:
 X_{1} ~ Beta(1, 1) and X_{2} ~ Beta(3, 3); D_{KL}(X_{1}  X_{2}) = 0.598803; D_{KL}(X_{2}  X_{1}) = 0.267864; h(X_{1}) = 0; h(X_{2}) = −0.267864
 X_{1} ~ Beta(3, 0.5) and X_{2} ~ Beta(0.5, 3); D_{KL}(X_{1}  X_{2}) = 7.21574; D_{KL}(X_{2}  X_{1}) = 7.21574; h(X_{1}) = −1.10805; h(X_{2}) = −1.10805.
The Kullback–Leibler divergence is not symmetric D_{KL}(X_{1}  X_{2}) ≠ D_{KL}(X_{2}  X_{1}) for the case in which the individual beta distributions Beta(1, 1) and Beta(3, 3) are symmetric, but have different entropies h(X_{1}) ≠ h(X_{2}). The value of the Kullback divergence depends on the direction traveled: whether going from a higher (differential) entropy to a lower (differential) entropy or the other way around. In the numerical example above, the Kullback divergence measures the inefficiency of assuming that the distribution is (bellshaped) Beta(3, 3), rather than (uniform) Beta(1, 1). The "h" entropy of Beta(1, 1) is higher than the "h" entropy of Beta(3, 3) because the uniform distribution Beta(1, 1) has a maximum amount of disorder. The Kullback divergence is more than two times higher (0.598803 instead of 0.267864) when measured in the direction of decreasing entropy: the direction that assumes that the (uniform) Beta(1, 1) distribution is (bellshaped) Beta(3, 3) rather than the other way around. In this restricted sense, the Kullback divergence is consistent with the second law of thermodynamics.
The Kullback–Leibler divergence is symmetric D_{KL}(X_{1}  X_{2}) = D_{KL}(X_{2}  X_{1}) for the skewed cases Beta(3, 0.5) and Beta(0.5, 3) that have equal differential entropy h(X_{1}) = h(X_{2}).
The symmetry condition:
follows from the above definitions and the mirrorsymmetry f(x; α, β) = f(1−x; α, β) enjoyed by the beta distribution.
Relationships between statistical measures
Mean, mode and median relationship
If 1 < α < β then mode ≤ median ≤ mean.^{[13]} Expressing the mode (only for α, β > 1), and the mean in terms of α and β:
If 1 < β < α then the order of the inequalities are reversed. For α, β > 1 the absolute distance between the mean and the median is less than 5% of the distance between the maximum and minimum values of x. On the other hand, the absolute distance between the mean and the mode can reach 50% of the distance between the maximum and minimum values of x, for the (pathological) case of α = 1 and β = 1 (for which values the beta distribution approaches the uniform distribution and the differential entropy approaches its maximum value, and hence maximum "disorder").
For example, for α = 1.0001 and β = 1.00000001:
 mode = 0.9999; PDF(mode) = 1.00010
 mean = 0.500025; PDF(mean) = 1.00003
 median = 0.500035; PDF(median) = 1.00003
 mean − mode = −0.499875
 mean − median = −9.65538 × 10^{−6}
(where PDF stands for the value of the probability density function)
Mean, geometric mean and harmonic mean relationship
It is known from the inequality of arithmetic and geometric means that the geometric mean is lower than the mean. Similarly, the harmonic mean is lower than the geometric mean. The accompanying plot shows that for α = β, both the mean and the median are exactly equal to 1/2, regardless of the value of α = β, and the mode is also equal to 1/2 for α = β > 1, however the geometric and harmonic means are lower than 1/2 and they only approach this value asymptotically as α = β → ∞.
Kurtosis bounded by the square of the skewness
As remarked by Feller,^{[5]} in the Pearson system the beta probability density appears as type I (any difference between the beta distribution and Pearson's type I distribution is only superficial and it makes no difference for the following discussion regarding the relationship between kurtosis and skewness). Karl Pearson showed, in Plate 1 of his paper ^{[23]} published in 1916, a graph with the kurtosis as the vertical axis (ordinate) and the square of the skewness as the horizontal axis (abscissa), in which a number of distributions were displayed.^{[33]} The region occupied by the beta distribution is bounded by the following two lines in the (skewness^{2},kurtosis) plane, or the (skewness^{2},excess kurtosis) plane:
or, equivalently,
(At a time when there were no powerful digital computers), Karl Pearson accurately computed further boundaries,^{[4]}^{[23]} for example, separating the "Ushaped" from the "Jshaped" distributions. The lower boundary line (excess kurtosis + 2 − skewness^{2} = 0) is produced by skewed "Ushaped" beta distributions with both values of shape parameters α and β close to zero. The upper boundary line (excess kurtosis − (3/2) skewness^{2} = 0) is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. Karl Pearson showed ^{[23]} that this upper boundary line (excess kurtosis − (3/2) skewness^{2} = 0) is also the intersection with Pearson's distribution III, which has unlimited support in one direction (towards positive infinity), and can be bellshaped or Jshaped. His son, Egon Pearson, showed ^{[33]} that the region (in the kurtosis/squaredskewness plane) occupied by the beta distribution (equivalently, Pearson's distribution I) as it approaches this boundary (excess kurtosis − (3/2) skewness^{2} = 0) is shared with the noncentral chisquared distribution. Karl Pearson^{[34]} (Pearson 1895, pp. 357, 360, 373–376) also showed that the gamma distribution is a Pearson type III distribution. Hence this boundary line for Pearson's type III distribution is known as the gamma line. (This can be shown from the fact that the excess kurtosis of the gamma distribution is 6/k and the square of the skewness is 4/k, hence (excess kurtosis − (3/2) skewness^{2} = 0) is identically satisfied by the gamma distribution regardless of the value of the parameter "k"). Pearson later noted that the chisquared distribution is a special case of Pearson's type III and also shares this boundary line (as it is apparent from the fact that for the chisquared distribution the excess kurtosis is 12/k and the square of the skewness is 8/k, hence (excess kurtosis − (3/2) skewness^{2} = 0) is identically satisfied regardless of the value of the parameter "k"). This is to be expected, since the chisquared distribution X ~ χ^{2}(k) is a special case of the gamma distribution, with parametrization X ~ Γ(k/2, 1/2) where k is a positive integer that specifies the "number of degrees of freedom" of the chisquared distribution.
An example of a beta distribution near the upper boundary (excess kurtosis − (3/2) skewness^{2} = 0) is given by α = 0.1, β = 1000, for which the ratio (excess kurtosis)/(skewness^{2}) = 1.49835 approaches the upper limit of 1.5 from below. An example of a beta distribution near the lower boundary (excess kurtosis + 2 − skewness^{2} = 0) is given by α= 0.0001, β = 0.1, for which values the expression (excess kurtosis + 2)/(skewness^{2}) = 1.01621 approaches the lower limit of 1 from above. In the infinitesimal limit for both α and β approaching zero symmetrically, the excess kurtosis reaches its minimum value at −2. This minimum value occurs at the point at which the lower boundary line intersects the vertical axis (ordinate). (However, in Pearson's original chart, the ordinate is kurtosis, instead of excess kurtosis, and it increases downwards rather than upwards).
Values for the skewness and excess kurtosis below the lower boundary (excess kurtosis + 2 − skewness^{2} = 0) cannot occur for any distribution, and hence Karl Pearson appropriately called the region below this boundary the "impossible region." The boundary for this "impossible region" is determined by (symmetric or skewed) bimodal "U"shaped distributions for which parameters α and β approach zero and hence all the probability density is concentrated at the ends: x = 0, 1 with practically nothing in between them. Since for α ≈ β ≈ 0 the probability density is concentrated at the two ends x = 0 and x = 1, this "impossible boundary" is determined by a 2point distribution: the probability can only take 2 values (Bernoulli distribution), one value with probability p and the other with probability q = 1−p. For cases approaching this limit boundary with symmetry α = β, skewness ≈ 0, excess kurtosis ≈ −2 (this is the lowest excess kurtosis possible for any distribution), and the probabilities are p ≈ q ≈ 1/2. For cases approaching this limit boundary with skewness, excess kurtosis ≈ −2 + skewness^{2}, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1.
Symmetry
All statements are conditional on α, β > 0
 Probability density function reflection symmetry
 Cumulative distribution function reflection symmetry plus unitary translation
 Mode reflection symmetry plus unitary translation
 Median reflection symmetry plus unitary translation
 Mean reflection symmetry plus unitary translation
 Geometric Means each is individually asymmetric, the following symmetry applies between the geometric mean based on X and the geometric mean based on its reflection (1X)
 Harmonic means each is individually asymmetric, the following symmetry applies between the harmonic mean based on X and the harmonic mean based on its reflection (1X)
 .
 Variance symmetry
 Geometric variances each is individually asymmetric, the following symmetry applies between the log geometric variance based on X and the log geometric variance based on its reflection (1X)
 Geometric covariance symmetry
 Mean absolute deviation around the mean symmetry
 Skewness skewsymmetry
 Excess kurtosis symmetry
 Characteristic function symmetry of Real part (with respect to the origin of variable "t")
 Characteristic function skewsymmetry of Imaginary part (with respect to the origin of variable "t")
 Characteristic function symmetry of Absolute value (with respect to the origin of variable "t")
 Differential entropy symmetry
 Relative Entropy (also called Kullback–Leibler divergence) symmetry
 Fisher information matrix symmetry
Geometry of the probability density function
Inflection points
For certain values of the shape parameters α and β, the probability density function has inflection points, at which the curvature changes sign. The position of these inflection points can be useful as a measure of the dispersion or spread of the distribution.
Defining the following quantity:
Points of inflection occur,^{[1]}^{[3]}^{[6]}^{[7]} depending on the value of the shape parameters α and β, as follows:
 (α > 2, β > 2) The distribution is bellshaped (symmetric for α = β and skewed otherwise), with two inflection points, equidistant from the mode:
 (α = 2, β > 2) The distribution is unimodal, positively skewed, righttailed, with one inflection point, located to the right of the mode:
 (α > 2, β = 2) The distribution is unimodal, negatively skewed, lefttailed, with one inflection point, located to the left of the mode:
 (1 < α < 2, β > 2, α+β>2) The distribution is unimodal, positively skewed, righttailed, with one inflection point, located to the right of the mode:
 (0 < α < 1, 1 < β < 2) The distribution has a mode at the left end x = 0 and it is positively skewed, righttailed. There is one inflection point, located to the right of the mode:
 (α > 2, 1 < β < 2) The distribution is unimodal negatively skewed, lefttailed, with one inflection point, located to the left of the mode:
 (1 < α < 2, 0 < β < 1) The distribution has a mode at the right end x=1 and it is negatively skewed, lefttailed. There is one inflection point, located to the left of the mode:
There are no inflection points in the remaining (symmetric and skewed) regions: Ushaped: (α, β < 1) upsidedownUshaped: (1 < α < 2, 1 < β < 2), reverseJshaped (α < 1, β > 2) or Jshaped: (α > 2, β < 1)
The accompanying plots show the inflection point locations (shown vertically, ranging from 0 to 1) versus α and β (the horizontal axes ranging from 0 to 5). There are large cuts at surfaces intersecting the lines α = 1, β = 1, α = 2, and β = 2 because at these values the beta distribution change from 2 modes, to 1 mode to no mode.
Shapes
The beta density function can take a wide variety of different shapes depending on the values of the two parameters α and β. The ability of the beta distribution to take this great diversity of shapes (using only two parameters) is partly responsible for finding wide application for modeling actual measurements:
Symmetric (α = β)
 the density function is symmetric about 1/2 (blue & teal plots).
 median = mean = 1/2.
 skewness = 0.
 variance = 1/(4(2α + 1))
 α = β < 1
 Ushaped (blue plot).
 bimodal: left mode = 0, right mode =1, antimode = 1/2
 1/12 < var(X) < 1/4^{[1]}
 −2 < excess kurtosis(X) < −6/5
 α = β = 1/2 is the arcsine distribution
 var(X) = 1/8
 excess kurtosis(X) = −3/2
 CF = Rinc (t) ^{[35]}
 α = β → 0 is a 2point Bernoulli distribution with equal probability 1/2 at each Dirac delta function end x = 0 and x = 1 and zero probability everywhere else. A coin toss: one face of the coin being x = 0 and the other face being x = 1.
 a lower value than this is impossible for any distribution to reach.
 The differential entropy approaches a minimum value of −∞
 α = β = 1
 the uniform [0, 1] distribution
 no mode
 var(X) = 1/12
 excess kurtosis(X) = −6/5
 The (negative anywhere else) differential entropy reaches its maximum value of zero
 CF = Sinc (t)
 α = β > 1
 symmetric unimodal
 mode = 1/2.
 0 < var(X) < 1/12^{[1]}
 −6/5 < excess kurtosis(X) < 0
 α = β = 3/2 is a semielliptic [0, 1] distribution, see: Wigner semicircle distribution ^{[36]}
 var(X) = 1/16.
 excess kurtosis(X) = −1
 CF = 2 Jinc (t)
 α = β = 2 is the parabolic [0, 1] distribution
 var(X) = 1/20
 excess kurtosis(X) = −6/7
 CF = 3 Tinc (t) ^{[37]}
 α = β > 2 is bellshaped, with inflection points located to either side of the mode
 0 < var(X) < 1/20
 −6/7 < excess kurtosis(X) < 0
 α = β → ∞ is a 1point Degenerate distribution with a Dirac delta function spike at the midpoint x = 1/2 with probability 1, and zero probability everywhere else. There is 100% probability (absolute certainty) concentrated at the single point x = 1/2.
 The differential entropy approaches a minimum value of −∞
Skewed (α ≠ β)
The density function is skewed. An interchange of parameter values yields the mirror image (the reverse) of the initial curve, some more specific cases:
 α < 1, β < 1
 Ushaped
 Positive skew for α < β, negative skew for α > β.
 bimodal: left mode = 0, right mode = 1, antimode =
 0 < median < 1.
 0 < var(X) < 1/4
 α > 1, β > 1
 unimodal (magenta & cyan plots),
 Positive skew for α < β, negative skew for α > β.
 0 < median < 1
 0 < var(X) < 1/12
 α < 1, β ≥ 1
 reverse Jshaped with a right tail,
 positively skewed,
 strictly decreasing, convex
 mode = 0
 0 < median < 1/2.
 (maximum variance occurs for , or α = Φ the golden ratio conjugate)
 α ≥ 1, β < 1
 Jshaped with a left tail,
 negatively skewed,
 strictly increasing, convex
 mode = 1
 1/2 < median < 1
 (maximum variance occurs for , or β = Φ the golden ratio conjugate)
 α = 1, β > 1
 positively skewed,
 strictly decreasing (red plot),
 a reversed (mirrorimage) power function [0,1] distribution
 mode = 0
 α = 1, 1 < β < 2
 concave
 1/18 < var(X) < 1/12.
 α = 1, β = 2
 a straight line with slope −2, the righttriangular distribution with right angle at the left end, at x = 0
 var(X) = 1/18
 α = 1, β > 2
 reverse Jshaped with a right tail,
 convex
 0 < var(X) < 1/18
 α > 1, β = 1
 negatively skewed,
 strictly increasing (green plot),
 the power function [0, 1] distribution^{[6]}
 mode =1
 2 > α > 1, β = 1
 concave
 1/18 < var(X) < 1/12
 α = 2, β = 1
 a straight line with slope +2, the righttriangular distribution with right angle at the right end, at x = 1
 var(X) = 1/18
 α > 2, β = 1
 Jshaped with a left tail, convex
 0 < var(X) < 1/18
Related distributions
Transformations
 If X ~ Beta(α, β) then 1 − X ~ Beta(β, α) mirrorimage symmetry
 If X ~ Beta(α, β) then . The beta prime distribution, also called "beta distribution of the second kind".
 If X ~ Beta(n/2, m/2) then (assuming n > 0 and m > 0), the Fisher–Snedecor F distribution.
 If then min + X(max − min) ~ PERT(min, max, m, λ) where PERT denotes a PERT distribution used in PERT analysis, and m=most likely value.^{[38]} Traditionally^{[39]} λ = 4 in PERT analysis.
 If X ~ Beta(1, β) then X ~ Kumaraswamy distribution with parameters (1, β)
 If X ~ Beta(α, 1) then X ~ Kumaraswamy distribution with parameters (α, 1)
 If X ~ Beta(α, 1) then −ln(X) ~ Exponential(α)
Special and limiting cases
 Beta(1, 1) ~ U(0, 1).
 If X ~ Beta(3/2, 3/2) and r > 0 then 2rX − r ~ Wigner semicircle distribution.
 Beta(1/2, 1/2) is equivalent to the arcsine distribution. This distribution is also Jeffreys prior probability for the Bernoulli and binomial distributions. The arcsine probability density is a distribution that appears in several randomwalk fundamental theorems. In a fair coin toss random walk, the probability for the time of the last visit to the origin is distributed as an (Ushaped) arcsine distribution.^{[5]}^{[15]} In a twoplayer faircointoss game, a player is said to be in the lead if the random walk (that started at the origin) is above the origin. The most probable number of times that a given player will be in the lead, in a game of length 2N, is not N. On the contrary, N is the least likely number of times that the player will be in the lead. The most likely number of times in the lead is 0 or 2N (following the arcsine distribution).
 the exponential distribution.
 the gamma distribution.
Derived from other distributions
 The kth order statistic of a sample of size n from the uniform distribution is a beta random variable, U_{(k)} ~ Beta(k, n+1−k).^{[40]}
 If X ~ Gamma(α, θ) and Y ~ Gamma(β, θ) are independent, then .
 If and are independent, then .
 If X ~ U(0, 1) and α > 0 then X^{1/α} ~ Beta(α, 1). The power function distribution.
 If , then for discrete values of n and k where and .^{[41]}
Combination with other distributions
 X ~ Beta(α, β) and Y ~ F(2β,2α) then for all x > 0.
Compounding with other distributions
 If p ~ Beta(α, β) and X ~ Bin(k, p) then X ~ betabinomial distribution
 If p ~ Beta(α, β) and X ~ NB(r, p) then X ~ beta negative binomial distribution
Generalisations
 The generalization to multiple variables, i.e. a multivariate Beta distribution, is called a Dirichlet distribution. Univariate marginals of the Dirichlet distribution have a beta distribution. The beta distribution is conjugate to the binomial and Bernoulli distributions in exactly the same way as the Dirichlet distribution is conjugate to the multinomial distribution and categorical distribution.
 The Pearson type I distribution is identical to the beta distribution (except for arbitrary shifting and rescaling that can also be accomplished with the four parameter parametrization of the beta distribution).
 the noncentral beta distribution
 The generalized beta distribution is a fiveparameter distribution family which has the beta distribution as a special case.
 The matrix variate beta distribution is a distribution for positivedefinite matrices.
Statistical Inference
Parameter estimation
Method of moments
Two unknown parameters
Two unknown parameters ( of a beta distribution supported in the [0,1] interval) can be estimated, using the method of moments, with the first two moments (sample mean and sample variance) as follows. Let:
be the sample mean estimate and
be the sample variance estimate. The methodofmoments estimates of the parameters are
 if
 if
When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace with and with in the above couple of equations for the shape parameters (see the "Alternative parametrizations, four parameters" section below).,^{[42]} where:
Four unknown parameters
All four parameters ( of a beta distribution supported in the [a, c] interval see section "Alternative parametrizations, Four parameters") can be estimated, using the method of moments developed by Karl Pearson, by equating sample and population values of the first four central moments (mean, variance, skewness and excess kurtosis).^{[1]}^{[43]}^{[44]} The excess kurtosis was expressed in terms of the square of the skewness, and the sample size ν = α + β, (see previous section "Kurtosis") as follows:
One can use this equation to solve for the sample size ν= α + β in terms of the square of the skewness and the excess kurtosis as follows:^{[43]}
This is the ratio (multiplied by a factor of 3) between the previously derived limit boundaries for the beta distribution in a space (as originally done by Karl Pearson^{[23]}) defined with coordinates of the square of the skewness in one axis and the excess kurtosis in the other axis (see previous section titled "Kurtosis bounded by the square of the skewness"):
The case of zero skewness, can be immediately solved because for zero skewness, α = β and hence ν = 2α = 2β, therefore α = β = ν/2
(Excess kurtosis is negative for the beta distribution with zero skewness, ranging from 2 to 0, so that and therefore the sample shape parameters is positive, ranging from zero when the shape parameters approach zero and the excess kurtosis approaches 2, to infinity when the shape parameters approach infinity and the excess kurtosis approaches zero).
For nonzero sample skewness one needs to solve a system of two coupled equations. Since the skewness and the excess kurtosis are independent of the parameters , the parameters can be uniquely determined from the sample skewness and the sample excess kurtosis, by solving the coupled equations with two known variables (sample skewness and sample excess kurtosis) and two unknowns (the shape parameters):
resulting in the following solution:^{[43]}
Where one should take the solutions as follows: for (negative) sample skewness < 0, and for (positive) sample skewness > 0.
The accompanying plot shows these two solutions as surfaces in a space with horizontal axes of (sample excess kurtosis) and (sample squared skewness) and the shape parameters as the vertical axis. The surfaces are constrained by the condition that the sample excess kurtosis must be bounded by the sample squared skewness as stipulated in the above equation. The two surfaces meet at the right edge defined by zero skewness. Along this right edge, both parameters are equal and the distribution is symmetric Ushaped for α = β < 1, uniform for α = β = 1, upsidedownUshaped for 1 < α = β < 2 and bellshaped for α = β > 2. The surfaces also meet at the front (lower) edge defined by "the impossible boundary" line (excess kurtosis + 2  skewness^{2} = 0). Along this front (lower) boundary both shape parameters approach zero, and the probability density is concentrated more at one end than the other end (with practically nothing in between), with probabilities at the left end x = 0 and at the right end x = 1. The two surfaces become further apart towards the rear edge. At this rear edge the surface parameters are quite different from each other. As remarked, for example, by Bowman and Shenton,^{[45]} sampling in the neighborhood of the line (sample excess kurtosis  (3/2)(sample skewness)^{2} = 0) (the justJshaped portion of the rear edge where blue meets beige), "is dangerously near to chaos", because at that line the denominator of the expression above for the estimate ν = α + β becomes zero and hence ν approaches infinity as that line is approached. Bowman and Shenton ^{[45]} write that "the higher moment parameters (kurtosis and skewness) are extremely fragile (near that line). However the mean and standard deviation are fairly reliable." Therefore, the problem is for the case of four parameter estimation for very skewed distributions such that the excess kurtosis approaches (3/2) times the square of the skewness. This boundary line is produced by extremely skewed distributions with very large values of one of the parameters and very small values of the other parameter. See section titled "Kurtosis bounded by the square of the skewness" for a numerical example and further comments about this rear edge boundary line (sample excess kurtosis  (3/2)(sample skewness)^{2} = 0). As remarked by Karl Pearson himself ^{[46]} this issue may not be of much practical importance as this trouble arises only for very skewed Jshaped (or mirrorimage Jshaped) distributions with very different values of shape parameters that are unlikely to occur much in practice). The usual skewedbellshape distributions that occur in practice do not have this parameter estimation problem.
The remaining two parameters can be determined using the sample mean and the sample variance using a variety of equations.^{[1]}^{[43]} One alternative is to calculate the support interval range based on the sample variance and the sample kurtosis. For this purpose one can solve, in terms of the range , the equation expressing the excess kurtosis in terms of the sample variance, and the sample size ν (see section titled "Kurtosis" and "Alternative parametrizations, four parameters"):
to obtain:
Another alternative is to calculate the support interval range based on the sample variance and the sample skewness.^{[43]} For this purpose one can solve, in terms of the range , the equation expressing the squared skewness in terms of the sample variance, and the sample size ν (see section titled "Skewness" and "Alternative parametrizations, four parameters"):
to obtain:^{[43]}
The remaining parameter can be determined from the sample mean and the previously obtained parameters: :
and finally, of course, .
In the above formulas one may take, for example, as estimates of the sample moments:
The estimators G_{1} for sample skewness and G_{2} for sample kurtosis are used by DAP/SAS, PSPP/SPSS, and Excel. However, they are not used by BMDP and (according to ^{[47]}) they were not used by MINITAB in 1998. Actually, Joanes and Gill in their 1998 study^{[47]} concluded that the skewness and kurtosis estimators used in BMDP and in MINITAB (at that time) had smaller variance and meansquared error in normal samples, but the skewness and kurtosis estimators used in DAP/SAS, PSPP/SPSS, namely G_{1} and G_{2}, had smaller meansquared error in samples from a very skewed distribution. It is for this reason that we have spelled out "sample skewness", etc., in the above formulas, to make it explicit that the user should choose the best estimator according to the problem at hand, as the best estimator for skewness and kurtosis depends on the amount of skewness (as shown by Joanes and Gill^{[47]}).
Maximum likelihood
Two unknown parameters
As is also the case for maximum likelihood estimates for the gamma distribution, the maximum likelihood estimates for the beta distribution do not have a general closed form solution for arbitrary values of the shape parameters. If X_{1}, ..., X_{N} are independent random variables each having a beta distribution, the joint log likelihood function for N iid observations is:
Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:
where:
since the digamma function denoted ψ(α) is defined as the logarithmic derivative of the gamma function:^{[21]}
To ensure that the values with zero tangent slope are indeed a maximum (instead of a saddlepoint or a minimum) one has to also satisfy the condition that the curvature is negative. This amounts to satisfying that the second partial derivative with respect to the shape parameters is negative
using the previous equations, this is equivalent to:
where the trigamma function, denoted ψ_{1}(α), is the second of the polygamma functions, and is defined as the derivative of the digamma function:
These conditions are equivalent to stating that the variances of the logarithmically transformed variables are positive, since:
Therefore, the condition of negative curvature at a maximum is equivalent to the statements:
Alternatively, the condition of negative curvature at a maximum is also equivalent to stating that the following logarithmic derivatives of the geometric means G_{X} and G_{(1−X)} are positive, since:
While these slopes are indeed positive, the other slopes are negative:
The slopes of the mean and the median with respect to α and β display similar sign behavior.
From the condition that at a maximum, the partial derivative with respect to the shape parameter equals zero, we obtain the following system of coupled maximum likelihood estimate equations (for the average loglikelihoods) that needs to be inverted to obtain the (unknown) shape parameter estimates in terms of the (known) average of logarithms of the samples X_{1}, ..., X_{N}:^{[1]}
where we recognize as the logarithm of the sample geometric mean and as the logarithm of the sample geometric mean based on (1 − X), the mirrorimage of X. For , it follows that .
These coupled equations containing digamma functions of the shape parameter estimates must be solved by numerical methods as done, for example, by Beckman et al.^{[48]} Gnanadesikan et al. give numerical solutions for a few cases.^{[49]} N.L.Johnson and S.Kotz^{[1]} suggest that for "not too small" shape parameter estimates , the logarithmic approximation to the digamma function may be used to obtain initial values for an iterative solution, since the equations resulting from this approximation can be solved exactly:
which leads to the following solution for the initial values (of the estimate shape parameters in terms of the sample geometric means) for an iterative solution:
Alternatively, the estimates provided by the method of moments can instead be used as initial values for an iterative solution of the maximum likelihood coupled equations in terms of the digamma functions.
When the distribution is required over a known interval other than [0, 1] with random variable X, say [a, c] with random variable Y, then replace ln(X_{i}) in the first equation with
and replace ln(1−X_{i}) in the second equation with
(see "Alternative parametrizations, four parameters" section below).
If one of the shape parameters is known, the problem is considerably simplified. The following logit transformation can be used to solve for the unknown shape parameter (for skewed cases such that , otherwise, if symmetric, both equal parameters are known when one is known):
This logit transformation is the logarithm of the transformation that divides the variable X by its mirrorimage (X/(1  X) resulting in the "inverted beta distribution" or beta prime distribution (also known as beta distribution of the second kind or Pearson's Type VI) with support [0, +∞). As previously discussed in the section "Moments of logarithmically transformed random variables," the logit transformation , studied by Johnson,^{[27]} extends the finite support [0, 1] based on the original variable X to infinite support in both directions of the real line (−∞, +∞).
If, for example, is known, the unknown parameter can be obtained in terms of the inverse^{[50]} digamma function of the right hand side of this equation:
In particular, if one of the shape parameters has a value of unity, for example for (the power function distribution with bounded support [0,1]), using the identity ψ(x + 1) = ψ(x) + 1/x in the equation , the maximum likelihood estimator for the unknown parameter is,^{[1]} exactly:
The beta has support [0, 1], therefore , and hence , and therefore
In conclusion, the maximum likelihood estimates of the shape parameters of a beta distribution are (in general) a complicated function of the sample geometric mean, and of the sample geometric mean based on (1−X), the mirrorimage of X. One may ask, if the variance (in addition to the mean) is necessary to estimate two shape parameters with the method of moments, why is the (logarithmic or geometric) variance not necessary to estimate two shape parameters with the maximum likelihood method, for which only the geometric means suffice? The answer is because the mean does not provide as much information as the geometric mean. For a beta distribution with equal shape parameters α = β, the mean is exactly 1/2, regardless of the value of the shape parameters, and therefore regardless of the value of the statistical dispersion (the variance). On the other hand, the geometric mean of a beta distribution with equal shape parameters α = β, depends on the value of the shape parameters, and therefore it contains more information. Also, the geometric mean of a beta distribution does not satisfy the symmetry conditions satisfied by the mean, therefore, by employing both the geometric mean based on X and geometric mean based on (1 − X), the maximum likelihood method is able to provide best estimates for both parameters α = β, without need of employing the variance.
One can express the joint log likelihood per N iid observations in terms of the sufficient statistics (the sample geometric means) as follows:
We can plot the joint log likelihood per N observations for fixed values of the sample geometric means to see the behavior of the likelihood function as a function of the shape parameters α and β. In such a plot, the shape parameter estimators correspond to the maxima of the likelihood function. See the accompanying graph that shows that all the likelihood functions intersect at α = β = 1, which corresponds to the values of the shape parameters that give the maximum entropy (the maximum entropy occurs for shape parameters equal to unity: the uniform distribution). It is evident from the plot that the likelihood function gives sharp peaks for values of the shape parameter estimators close to zero, but that for values of the shape parameters estimators greater than one, the likelihood function becomes quite flat, with less defined peaks. Obviously, the maximum likelihood parameter estimation method for the beta distribution becomes less acceptable for larger values of the shape parameter estimators, as the uncertainty in the peak definition increases with the value of the shape parameter estimators. One can arrive at the same conclusion by noticing that the expression for the curvature of the likelihood function is in terms of the geometric variances
These variances (and therefore the curvatures) are much larger for small values of the shape parameter α and β. However, for shape parameter values α, β > 1, the variances (and therefore the curvatures) flatten out. Equivalently, this result follows from the Cramér–Rao bound, since the Fisher information matrix components for the beta distribution are these logarithmic variances. The Cramér–Rao bound states that the variance of any unbiased estimator of α is bounded by the reciprocal of the Fisher information:
so the variance of the estimators increases with increasing α and β, as the logarithmic variances decrease.
Also one can express the joint log likelihood per N iid observations in terms of the digamma function expressions for the logarithms of the sample geometric means as follows:
this expression is identical to the negative of the crossentropy (see section on "Quantities of information (entropy)"). Therefore, finding the maximum of the joint log likelihood of the shape parameters, per N iid observations, is identical to finding the minimum of the crossentropy for the beta distribution, as a function of the shape parameters.
with the crossentropy defined as follows:
Four unknown parameters
The procedure is similar to the one followed in the two unknown parameter case. If Y_{1}, ..., Y_{N} are independent random variables each having a beta distribution with four parameters, the joint log likelihood function for N iid observations is:
Finding the maximum with respect to a shape parameter involves taking the partial derivative with respect to the shape parameter and setting the expression equal to zero yielding the maximum likelihood estimator of the shape parameters:
these equations can be rearranged as the following system of four coupled equations (the first two equations are geometric means and the second two equations are the harmonic means) in terms of the maximum likelihood estimates for the four parameters :
with sample geometric means:
The parameters are embedded inside the geometric mean expressions in a nonlinear way (to the power 1/N). This precludes, in general, a closed form solution, even for an initial value approximation for iteration purposes. One alternative is to use as initial values for iteration the values obtained from the method of moments solution for the four parameter case. Furthermore, the expressions for the harmonic means are welldefined only for , which precludes a maximum likelihood solution for shape parameters less than unity in the fourparameter case. Fisher's information matrix for the four parameter case is positivedefinite only for α, β > 2 (for further discussion, see section on Fisher information matrix, four parameter case), for bellshaped (symmetric or unsymmetric) beta distributions, with inflection points located to either side of the mode. The following Fisher information components (that represent the expectations of the curvature of the log likelihood function) have singularities at the following values:
(for further discussion see section on Fisher information matrix). Thus, it is not possible to strictly carry on the maximum likelihood estimation for some well known distributions belonging to the fourparameter beta distribution family, like the uniform distribution (Beta(1, 1, a, c)), and the arcsine distribution (Beta(1/2, 1/2, a, c)). N.L.Johnson and S.Kotz^{[1]} ignore the equations for the harmonic means and instead suggest "If a and c are unknown, and maximum likelihood estimators of a, c, α and β are required, the above procedure (for the two unknown parameter case, with X transformed as X = (Y − a)/(c − a)) can be repeated using a succession of trial values of a and c, until the pair (a, c) for which maximum likelihood (given a and c) is as great as possible, is attained" (where, for the purpose of clarity, their notation for the parameters has been translated into the present notation).
Fisher information matrix
Let a random variable X have a probability density f(x;α). The partial derivative with respect to the (unknown, and to be estimated) parameter α of the log likelihood function is called the score. The second moment of the score is called the Fisher information:
The expectation of the score is zero, therefore the Fisher information is also the second moment centered on the mean of the score: the variance of the score.
If the log likelihood function is twice differentiable with respect to the parameter α, and under certain regularity conditions,^{[51]} then the Fisher information may also be written as follows (which is often a more convenient form for calculation purposes):
Thus, the Fisher information is the negative of the expectation of the second derivative with respect to the parameter α of the log likelihood function. Therefore, Fisher information is a measure of the curvature of the log likelihood function of α. A low curvature (and therefore high radius of curvature), flatter log likelihood function curve has low Fisher information; while a log likelihood function curve with large curvature (and therefore low radius of curvature) has high Fisher information. When the Fisher information matrix is computed at the evaluates of the parameters ("the observed Fisher information matrix") it is equivalent to the replacement of the true log likelihood surface by a Taylor's series approximation, taken as far as the quadratic terms.^{[52]} The word information, in the context of Fisher information, refers to information about the parameters. Information such as: estimation, sufficiency and properties of variances of estimators. The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any estimator of a parameter α:
The precision to which one can estimate the estimator of a parameter α is limited by the Fisher Information of the log likelihood function. The Fisher information is a measure of the minimum error involved in estimating a parameter of a distribution and it can be viewed as a measure of the resolving power of an experiment needed to discriminate between two alternative hypothesis of a parameter.^{[53]}
When there are N parameters
then the Fisher information takes the form of an N×N positive semidefinite symmetric matrix, the Fisher Information Matrix, with typical element:
Under certain regularity conditions,^{[51]} the Fisher Information Matrix may also be written in the following form, which is often more convenient for computation:
With X_{1}, ..., X_{N} iid random variables, an Ndimensional "box" can be constructed with sides X_{1}, ..., X_{N}. Costa and Cover^{[54]} show that the (Shannon) differential entropy h(X) is related to the volume of the typical set (having the sample entropy close to the true entropy), while the Fisher information is related to the surface of this typical set.
Two parameters
For X_{1}, ..., X_{N} independent random variables each having a beta distribution parametrized with shape parameters α and β, the joint log likelihood function for N iid observations is:
therefore the joint log likelihood function per N iid observations is:
For the two parameter case, the Fisher information has 4 components: 2 diagonal and 2 offdiagonal. Since the Fisher information matrix is symmetric, one of these off diagonal components is independent. Therefore, the Fisher information matrix has 3 independent components (2 diagonal and 1 off diagonal).
Aryal and Nadarajah^{[55]} calculated Fisher's information matrix for the fourparameter case, from which the two parameter case can be obtained as follows: