Maximum entropy probability distribution

From testwiki
Jump to navigation Jump to search

Template:Short description Template:Multiple issues

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy and differential entropy

Template:Further

If X is a continuous random variable with probability density p(x), then the differential entropy of  X  is defined as[1][2][3]

H(X)=p(x) logp(x) dx.

If  X  is a discrete random variable with distribution given by

Pr{ X=xk}=pk for k=1, 2, 

then the entropy of  X  is defined as

H(X)=k1 pk logpk.

The seemingly divergent term  p(x) logp(x)  is replaced by zero, whenever  p(x)=0.

This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing  H(X)  will also maximize the more general forms.

The base of the logarithm is not important, as long as the same one is used consistently: Change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists often prefer the natural logarithm, resulting in a unit of "nat"s for the entropy.

However, the chosen measure  dx  is crucial, even though the typical use of the Lebesgue measure is often defended as a "natural" choice: Which measure is chosen determines the entropy and the consequent maximum entropy distribution.

Distributions with measured constants

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

Continuous case

Suppose  S  is a continuous, closed subset of the real numbers  โ„  and we choose to specify  n  measurable functions  f1, , fn  and  n  numbers  a1, , an. We consider the class  C  of all real-valued random variables which are supported on  S  (i.e. whose density function is zero outside of  S ) and which satisfy the  n  moment conditions:

๐”ผ{ fj(X)} aj for j=1, , n

If there is a member in  C  whose density function is positive everywhere in  S , and if there exists a maximal entropy distribution for  C , then its probability density  p(x)  has the following form:

p(x)=exp( j=0n λjfj(x) ) for all xS

where we assume that  f0(x)=1. The constant  λ0  and the  n  Lagrange multipliers  λ=(λ1, , λn)  solve the constrained optimization problem with  a0=1  (which ensures that  p  integrates to unity):[4]

 max λ0; λ  { j=0nλjaj exp(j=0nλjfj(x)) dx }subject toλ๐ŸŽ 

Using the Karushโ€“Kuhnโ€“Tucker conditions, it can be shown that the optimization problem has a unique solution because the objective function in the optimization is concave in  λ.

Note that when the moment constraints are equalities (instead of inequalities), that is,

 ๐”ผ{ fj(X) }=aj for j=1, , n ,

then the constraint condition  λ๐ŸŽ  can be dropped, which makes optimization over the Lagrange multipliers unconstrained.

Discrete case

Suppose  S={ x1, x2,  }  is a (finite or infinite) discrete subset of the reals, and that we choose to specify  n  functions  f1,  ,fn  and  n  numbers  a1,  ,an. We consider the class  C  of all discrete random variables  X  which are supported on  S  and which satisfy the  n  moment conditions

 ๐”ผ{ fj(X) }aj for j=1,  ,n 

If there exists a member of class  C  which assigns positive probability to all members of  S  and if there exists a maximum entropy distribution for  C , then this distribution has the following shape:

 โ„™{ X=xk }=exp( j=0nλj fj(xk) ) for k=1, 2,  

where we assume that f0=1 and the constants  λ0,λ(λ1,  ,λn)  solve the constrained optimization problem with  a0=1:[5]

 maxλ0; λ{ j=0n λj ajk1 exp( j=0n λj fj(xk) ) }for whichλ๐ŸŽ 

Again as above, if the moment conditions are equalities (instead of inequalities), then the constraint condition  λ๐ŸŽ  is not present in the optimization.

Proof in the case of equality constraints

In the case of equality constraints, this theorem is proved with the calculus of variations and Lagrange multipliers. The constraints can be written as

fj(x)p(x)dx=aj

We consider the functional

J(p)=p(x)lnp(x)dxη0(p(x)dx1)j=1nλj(fj(x)p(x)dxaj)

where η0 and λj,j1 are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other constraints are that the measurements of the function are given constants up to order n. The entropy attains an extremum when the functional derivative is equal to zero:

δJδp(p)=lnp(x)+1η0j=1nλjfj(x)=0

Therefore, the extremal entropy probability distribution in this case must be of the form (λ0:=η01),

p(x)=e1+η0ej=1nλjfj(x)=exp(j=0nλjfj(x)),

remembering that f0(x)=1. It can be verified that this is the maximal solution by checking that the variation around this solution is always negative.

Uniqueness of the maximum

Suppose  p, p  are distributions satisfying the expectation-constraints. Letting  α(0,1)  and considering the distribution  q=αp+(1α)p  it is clear that this distribution satisfies the expectation-constraints and furthermore has as support  supp(q)=supp(p)supp(p). From basic facts about entropy, it holds that  โ„‹(q)α โ„‹(p)+(1α) โ„‹(p). Taking limits  α1  and  α0 , respectively, yields  โ„‹(q)โ„‹(p),โ„‹(p).

It follows that a distribution satisfying the expectation-constraints and maximising entropy must necessarily have full support โ€” i. e. the distribution is almost everywhere strictly positive. It follows that the maximising distribution must be an internal point in the space of distributions satisfying the expectation-constraints, that is, it must be a local extreme. Thus it suffices to show that the local extreme is unique, in order to show both that the entropy-maximising distribution is unique (and this also shows that the local extreme is the global maximum).

Suppose  p, p  are local extremes. Reformulating the above computations these are characterised by parameters  λโ†’, λโ†’โ„n  via  p(x)= expλโ†’,fโ†’(x)  C(λโ†’)   and similarly for  p , where  C(λโ†’)=xโ„expλโ†’,fโ†’(x) dx. We now note a series of identities: Via 1the satisfaction of the expectation-constraints and utilising gradients / directional derivatives, one has

 D logC() |λโ†’=D C()C() |λโ†’=๐”ผp{ fโ†’(X) }=aโ†’  and similarly for  λโ†’.  Letting  u=λโ†’λโ†’โ„n  one obtains:

0=u,aโ†’aโ†’=DulogC() |λโ†’DulogC() |λโ†’=Du2logC() |γโ†’

where  γโ†’=θ λโ†’+(1θ)λโ†’  for some  θ(0,1). Computing further one has

0=Du2 logC() |γโ†’=Du (  Du C() C()) |γโ†’= Du2C() C() |γโ†’( DuC())2 C()2 |γโ†’=๐”ผq{ u,fโ†’(X)2 }(๐”ผq{ u,fโ†’(X) })2=Varq{ u,fโ†’(X) }

where  q  is similar to the distribution above, only parameterised by  γโ†’, Assuming that no non-trivial linear combination of the observables is almost everywhere (a.e.) constant, (which e.g. holds if the observables are independent and not a.e. constant), it holds that  u,fโ†’(X)  has non-zero variance, unless  u=0. By the above equation it is thus clear, that the latter must be the case. Hence  λโ†’λโ†’=u=0 , so the parameters characterising the local extrema  p, p  are identical, which means that the distributions themselves are identical. Thus, the local extreme is unique and by the above discussion, the maximum is unique โ€“ provided a local extreme actually exists.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy.[lower-alpha 1] It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Examples

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution has its own entropy. To see this, rewrite the density as p(x)=exp(lnp(x)) and compare to the expression of the theorem above. By choosing lnp(x)f(x) to be the measurable function and

exp(f(x))f(x)dx=H

to be the constant, p(x) is the maximum entropy probability distribution under the constraint

p(x)f(x)dx=H.

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure lnp(x)f(x) and finding that f(x) can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972)[6] and Park & Bera (2009).[7]

Uniform and piecewise uniform distributions

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we are given a subdivision a=a0 < a1 < ... < ak = b of the interval [a,b] and probabilities p1,...,pk that add up to one, then we can consider the class of all continuous distributions such that

Pr(aj1X<aj)=pj for j=1,,k

The density of the maximum entropy distribution for this class is constant on each of the intervals [aj-1,aj). The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and specified mean: the exponential distribution

The exponential distribution, for which the density function is

p(x|λ)={λeλxx0,0x<0,

is the maximum entropy distribution among all continuous distributions supported in [0,โˆž) that have a specified mean of 1/ฮป.

In the case of distributions supported on [0,โˆž), the maximum entropy distribution depends on relationships between the first and second moments. In specific cases, it may be the exponential distribution, or may be another distribution, or may be undefinable.[8]

Specified mean and variance: the normal distribution

The normal distribution N(ฮผ,ฯƒ2), for which the density function is

p(x|μ,σ)=1σ2πe(xμ)22σ2,

has maximum entropy among all real-valued distributions supported on (−โˆž,โˆž) with a specified variance ฯƒ2 (a particular moment). The same is true when the mean ฮผ and the variance ฯƒ2 is specified (the first two moments), since entropy is translation invariant on (−โˆž,โˆž). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments. (See the differential entropy article for a derivation.)

Discrete distributions with specified mean

Among all the discrete distributions supported on the set {x1,...,xn} with a specified mean ฮผ, the maximum entropy distribution has the following shape:

Pr(X=xk)=Crxk for k=1,,n

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be ฮผ.

For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} and ฮผ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x1,x2,...} with mean ฮผ, the maximum entropy distribution has the shape:

Pr(X=xk)=Crxk for k=1,2,,

where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be ฮผ. For example, in the case that xk = k, this gives

C=1μ1,r=μ1μ,

such that respective maximum entropy distribution is the geometric distribution.

Circular random variables

For a continuous random variable θi distributed about the unit circle, the Von Mises distribution maximizes the entropy when the real and imaginary parts of the first circular moment are specified[9] or, equivalently, the circular mean and circular variance are specified.

When the mean and variance of the angles θi modulo 2π are specified, the wrapped normal distribution maximizes the entropy.[9]

Maximizer for specified mean, variance and skew

There exists an upper bound on the entropy of continuous random variables on โ„ with a specified mean, variance, and skew. However, there is no distribution which achieves this upper bound, because p(x)=cexp(λ1x+λ2x2+λ3x3) is unbounded when λ30 (see Cover & Thomas (2006: chapter 12)).

However, the maximum entropy is Template:Mvar-achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many Template:Mvar larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

This is a special case of the general case in which the exponential of any odd-order polynomial in x will be unbounded on โ„. For example, ceλx will likewise be unbounded on โ„, but when the support is limited to a bounded or semi-bounded interval the upper entropy bound may be achieved (e.g. if x lies in the interval [0,∞] and λ< 0, the exponential distribution will result).

Maximizer for specified mean and deviation risk measure

Every distribution with log-concave density is a maximal entropy distribution with specified mean Template:Mvar and deviation risk measure Template:Math .[10]

In particular, the maximal entropy distribution with specified mean  E(X)μ  and deviation  D(X)d  is:

  • The normal distribution  ๐’ฉ(m,d2) , if  D(X)= ๐”ผ{ (Xμ)2 }   is the standard deviation;
  • The Laplace distribution, if D(X)=๐”ผ{ | Xμ| }  is the average absolute deviation;[6]
  • The distribution with density of the form f(x)=c exp( a x+bxμ2 ) if D(X)= ๐”ผ{ Xμ2 } is the standard lower semi-deviation, where a,b,c are constants and the function  y  min{ 0, y }  for any  yโ„ , returns only the negative values of its argument, otherwise zero.[10]

Other examples

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that  x  be included in the support of the probability density, which is listed in the fourth column.[6][7]

Several listed examples (Bernoulli, geometric, exponential, Laplace, Pareto) are trivially true, because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity.

For reference,  Γ(x)=0ettx1 dt  is the gamma function, ψ(x)=d dx lnΓ(x)=Γ(x) Γ(x)   is the digamma function,  B(p,q)= Γ(p) Γ(q) Γ(p+q)  is the beta function, and  γ๐–ค  is the Euler-Mascheroni constant.

Table of probability distributions and corresponding maximum entropy constraints
Distribution name Probability density / mass function Maximum Entropy constraint Support
Uniform (discrete)  f(k)=1ba+1  None  {a,a+1,...,b1,b} 
Uniform (continuous)  f(x)=1ba  None  [a,b] 
Bernoulli  f(k)=pk(1p)1k   ๐”ผ[ K ]=p   {0,1} 
Geometric  f(k)=(1p)k1 p   ๐”ผ[ K ]=1p   โ„•{0}={1,2,3,...} 
Exponential  f(x)=λexp(λx)   ๐”ผ[ X ]=1λ   [0,) 
Laplace  f(x)=12bexp(|xμ|b)   ๐”ผ[ |Xμ| ]=b   (,) 
Asymmetric Laplace  f(x)= λ exp((xm) λ s κs) (κ+1κ) 
where ssgn(xm) 
 ๐”ผ[ (Xm) s κs ]=1λ   (,) 
Pareto  f(x)= α xmα  xα+1    ๐”ผ[ lnX ]=1α+ln(xm)   [xm,) 
Normal  f(x)=12πσ2exp((xμ)22σ2)   ๐”ผ[ X ] =μ ,
 ๐”ผ[ X2 ]=σ2+μ2 
 (,) 
Truncated normal (see article)  ๐”ผ[ X ] =μ๐–ณ ,
 ๐”ผ[ X2 ]=σ๐–ณ2+μ๐–ณ2 
 [a,b] 
von Mises  f(θ)=12πI0(κ)exp(κcos(θμ))   ๐”ผ[ cosΘ ]=I1(κ)I0(κ)cosμ ,
 ๐”ผ[ sinΘ ]=I1(κ)I0(κ)sinμ 
 [0,2π) 
Rayleigh  f(x)=xσ2exp(x22σ2)   ๐”ผ[ X2 ] =2σ2 ,
 ๐”ผ[ lnX ]=ln(2σ2)γE2 
 [0,) 
Beta  f(x)=xα1(1x)β1B(α,β) for 0x1   ๐”ผ[ lnX ]=ψ(α)ψ(α+β) ,
 ๐”ผ[ ln(1X) ]=ψ(β)ψ(α+β) 
 [0,1] 
Cauchy  f(x)=1π(1+x2)   ๐”ผ[ ln(1+X2) ]=2ln2   (,) 
Chi  f(x)=22k/2Γ(k/2)xk1exp(x22)   ๐”ผ[ X2 ]=k ,
 ๐”ผ[ lnX ]=12[ψ(k2)+ln(2)] 
 [0,) 
Chi-squared  f(x)=12k/2Γ(k/2)xk21exp(x2)   ๐”ผ[ X ]=k ,
 ๐”ผ[ lnX ]=ψ(k2)+ln(2) 
 [0,) 
Erlang  f(x)=λk(k1)!xk1exp(λx)   ๐”ผ[ X ]=k/λ ,

 ๐”ผ[ lnX ]=ψ(k)ln(λ) 

 [0,) 
Gamma  f(x)= xk1exp(x θ ) θk Γ(k)   ๐”ผ[ X ]=k θ ,
 ๐”ผ[ lnX ]=ψ(k)+lnθ 
 [0,) 
Lognormal  f(x)=1 σ x2π  exp((lnxμ)2 2σ2)   ๐”ผ[ lnX ]=μ ,
 ๐”ผ[ ln(X)2 ]=σ2+μ2 
 (0,) 
Maxwellโ€“Boltzmann  f(x)=1a3  2 π  x2exp( x2 2a2)   ๐”ผ[ X2 ]=3a2 ,
 ๐”ผ[ lnX ]=1+ln(a2)γE2 
 [0,) 
Weibull  f(x)=kλk  xk1 exp( xk λk)   ๐”ผ[ Xk ]=λk ,
 E[ lnX ]=ln(λ)γEk 
 [0,) 
Multivariate normal  fX(xโ†’)= 
  exp( 1 2 [xโ†’μโ†’]Σ1[xโ†’μโ†’])  (2π)N|Σ|  
 ๐”ผ[ Xโ†’ ]=μโ†’ ,
 ๐”ผ[ (Xโ†’μโ†’)(Xโ†’μโ†’) ]=Σ 
 โ„n 
Binomial  f(k)=(nk)pk(1p)nk   ๐”ผ[ X ]=μ ,
 f  n-generalized binomial distribution[11]
{0,,n}
Poisson  f(k)=λkexp(λ)k!   ๐”ผ[ X ]=λ ,
 f -generalized binomial distribution}[11]
 โ„•={0,1,} 
Logistic  f(x)=ex(1+ex)2  =e+x(e+x+1)2    ๐”ผ[ X ]=0 ,
 ๐”ผ[ ln(1+eX) ]=1 
 {,} 

The maximum entropy principle can be used to upper bound the entropy of statistical mixtures.[12]

See also

Notes

Template:Notelist

Citations

Template:Reflist

References

Template:ProbDistributions


Cite error: <ref> tags exist for a group named "lower-alpha", but no corresponding <references group="lower-alpha"/> tag was found