Jensen–Shannon divergence

From testwiki
Jump to navigation Jump to search

Template:Short description In probability theory and statistics, the Jensen–Shannon divergence, named after Johan Jensen and Claude Shannon, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad)[1][2] or total divergence to the average.[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.[4][5][6]

Definition

Consider the set M+1(A) of probability distributions where A is a set provided with some σ-algebra of measurable subsets. In particular we can take A to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence D(PQ). It is defined by

JSD(PQ)=12D(PM)+12D(QM),

where M=12(P+Q) is a mixture distribution of P and Q.

The geometric Jensen–Shannon divergence[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

JSDπ1,,πn(P1,P2,,Pn)=iπiD(PiM)=H(M)i=1nπiH(Pi)

where

M:=i=1nπiPi

and π1,,πn are weights that are selected for the probability distributions P1,P2,,Pn, and H(P) is the Shannon entropy for distribution P. For the two-distribution case described above,

P1=P,P2=Q,π1=π2=12. 

Hence, for those distributions P,Q

JSD=H(M)12(H(P)+H(Q))

Bounds

The Jensen–Shannon divergence is bounded by 1 for two probability distributions, given that one uses the base 2 logarithm:[8]

0JSD(PQ)1.

With this normalization, it is a lower bound on the total variation distance between P and Q:

JSD(PQ)12PQ1=12ωΩ|P(ω)Q(ω)|.

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is ln(2). In general, the bound in base b is logb(2):

0JSD(PQ)logb(2).

A more general bound, the Jensen–Shannon divergence is bounded by logb(n) for more than two probability distributions:[8]

0JSDπ1,,πn(P1,P2,,Pn)logb(n).

Relation to mutual information

The Jensen–Shannon divergence is the mutual information between a random variable X associated to a mixture distribution between P and Q and the binary indicator variable Z that is used to switch between P and Q to produce the mixture. Let X be some abstract function on the underlying set of events that discriminates well between events, and choose the value of X according to P if Z=0 and according to Q if Z=1, where Z is equiprobable. That is, we are choosing X according to the probability measure M=(P+Q)/2, and its distribution is the mixture distribution. We compute

I(X;Z)=H(X)H(X|Z)=MlogM+12[PlogP+QlogQ]=P2logMQ2logM+12[PlogP+QlogQ]=12P(logPlogM)+12Q(logQlogM)=JSD(PQ)

It follows from the above result that the Jensen–Shannon divergence is bounded by 0 and 1 because mutual information is non-negative and bounded by H(Z)=1 in base 2 logarithm.

One can apply the same principle to a joint distribution and the product of its two marginal distribution (in analogy to Kullback–Leibler divergence and mutual information) and to measure how reliably one can decide if a given response comes from the joint distribution or the product distribution—subject to the assumption that these are the only two possibilities.[9]

Quantum Jensen–Shannon divergence

The generalization of probability distributions on density matrices allows to define quantum Jensen–Shannon divergence (QJSD).[10][11] It is defined for a set of density matrices (ρ1,,ρn) and a probability distribution π=(π1,,πn) as

QJSD(ρ1,,ρn)=S(i=1nπiρi)i=1nπiS(ρi)

where S(ρ) is the von Neumann entropy of ρ. This quantity was introduced in quantum information theory, where it is called the Holevo information: it gives the upper bound for amount of classical information encoded by the quantum states (ρ1,,ρn) under the prior distribution π (see Holevo's theorem).[12] Quantum Jensen–Shannon divergence for π=(12,12) and two density matrices is a symmetric function, everywhere defined, bounded and equal to zero only if two density matrices are the same. It is a square of a metric for pure states,[13] and it was recently shown that this metric property holds for mixed states as well.[14][15] The Bures metric is closely related to the quantum JS divergence; it is the quantum analog of the Fisher information metric.

Jensen–Shannon centroid

The centroid C* of a finite set of probability distributions can be defined as the minimizer of the average sum of the Jensen-Shannon divergences between a probability distribution and the prescribed set of distributions: C*=argminQi=1nJSD(PiQ) An efficient algorithm[16] (CCCP) based on difference of convex functions is reported to calculate the Jensen-Shannon centroid of a set of discrete distributions (histograms).

Applications

The Jensen–Shannon divergence has been applied in bioinformatics and genome comparison,Template:RTemplate:R in protein surface comparison,Template:R in the social sciences,Template:R in the quantitative study of history,Template:R in fire experiments,[17] and in machine learning.[18]

Notes

Template:Reflist