Integral probability metric: Difference between revisions

Latest revision as of 15:34, 3 May 2024

In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions. Many important statistical distances are integral probability metrics, including the Wasserstein-1 distance and the total variation distance. In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.

The name "integral probability metric" was given by German statistician Alfred Müller;^[1] the distances had also previously been called "metrics with a Template:Math-structure."^[2]

Definition

Integral probability metrics (IPMs) are distances on the space of distributions over a set $𝒳$ , defined by a class $ℱ$ of real-valued functions on $𝒳$ as $D_{ℱ} (P, Q) = \sup_{f \in ℱ} | 𝔼_{X \sim P} f (X) - 𝔼_{Y \sim Q} f (Y) | = \sup_{f \in ℱ} | P f - Q f |;$ here the notation Template:Math refers to the expectation of Template:Math under the distribution Template:Math. The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every $f \in ℱ$ its negation $- f$ is also in $ℱ$ .

The functions Template:Math being optimized over are sometimes called "critic" functions;^[3] if a particular $f^{*} \in ℱ$ achieves the supremum, it is often termed a "witness function"^[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from Template:Math and small (likely negative) values for samples from Template:Math; this can be thought of as a weaker version of classifers, and indeed IPMs can be interpreted as the optimal risk of a particular classifier.Template:R

The choice of $ℱ$ determines the particular distance; more than one $ℱ$ can generate the same distance.^[1]

For any choice of $ℱ$ , $D_{ℱ}$ satisfies all the definitions of a metric except that we may have we may have $D_{ℱ} (P, Q) = 0$ for some Template:Math; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class $ℱ = {x \mapsto 0}$ which only contains the zero function, $D_{ℱ} (P, Q)$ is identically zero. $D_{ℱ}$ is a metric if and only if $ℱ$ separates points on the space of probability distributions, i.e. for any Template:Math there is some $f \in ℱ$ such that $P f \neq Q f$ ;^[1] most, but not all, common particular cases satisfy this property.

Examples

All of these examples are metrics except when noted otherwise.

The Wasserstein-1 distance (also called earth mover's distance), via its dual representation, has $ℱ$ the set of 1-Lipschitz functions.
The related Dudley metric is generated by the set of bounded 1-Lipschitz functions.
The total variation distance can be generated by $ℱ = {f : 𝒳 \to {0, 1}}$ , so that $ℱ$ is a set of indicator functions for any event, or by the larger class $ℱ = {f : 𝒳 \to [0, 1]}$ .
The closely related Radon metric is generated by continuous functions bounded in Template:Math.
The Kolmogorov metric used in the Kolmogorov-Smirnov test has a function class of indicator functions, $ℱ = {1_{(- \infty, t]} : t \in ℝ}$ .
The kernel maximum mean discrepancy (MMD) has $ℱ$ the unit ball in a reproducing kernel Hilbert space. This distance is particularly easy to estimate from samples, requiring no optimization; it is a proper metric exactly when the underlying kernel is characteristic.^[5]
The energy distance, as a special case of the maximum mean discrepancy,^[6] is generated by the unit ball in a particular reproducing kernel Hilbert space.
Defining $ℱ$ by functions with a bounded Sobolev norm gives a useful distance for generative modeling, among other applications.^[7]
Functions with bounded Besov norm generalize many other forms of IPM and are amenable to theoretical analysis.^[8]^[9]
Many variants of generative adversarial networks and classifer-based two-sample tests^[10]^[11] use a "neural net distance"^[12]^[13] where $ℱ$ is a class of neural networks; these are not metrics for typical fixed-size networks, but could be for other classifiers. For Wasserstein GANs in particular, it has been argued that analysis in terms of this distance and not the Wasserstein they approximate is very important to the behavior of these models.^[12]^[14]^[15]

Relationship to Template:Math-divergences

The [[F-divergence|Template:Math-divergences]] are probably the best-known way to measure dissimilarity of probability distributions. It has been shownTemplate:R that the only functions which are both IPMs and Template:Math-divergences are of the form $c TV (P, Q)$ , where $c \in [0, \infty]$ and $TV$ is the total variation distance between distributions.

One major difference between Template:Math-divergences and most IPMs is that when Template:Math and Template:Math have disjoint support, all Template:Math-divergences take on a constant value;^[16] by contrast, IPMs where functions in $ℱ$ are "smooth" can give "partial credit." For instance, consider the sequence $δ_{1 / n}$ of Dirac measures at Template:Math; this sequence converges in distribution to $δ_{0}$ , and many IPMs satisfy $D_{ℱ} (δ_{1 / n}, δ_{0}) \to 0$ , but no nonzero Template:Math-divergence can satisfy this. That is, many IPMs are continuous in weaker topologies than Template:Math-divergences. This property is sometimes of substantial importance,^[17] although other options also exist, such as considering Template:Math-divergences between distributions convolved with continuous noise.^[17]^[18]

Estimation from samples

Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate $D_{ℱ} (P, Q)$ using a simple "plug-in" estimator: $D_{ℱ} (\hat{P}, \hat{Q})$ where $\hat{P}$ and $\hat{Q}$ are empirical measures of sample sets. These empirical distances can be computed exactly for some classes $ℱ$ ;^[19] estimation quality varies depending on the distance, but can be minimax-optimal in certain settings.^[13]^[20]^[21]

When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures ${\hat{P}}_{𝑡 𝑟 𝑎 𝑖 𝑛}$ and ${\hat{Q}}_{𝑡 𝑟 𝑎 𝑖 𝑛}$ ) and "test" sets ( ${\hat{P}}_{𝑡 𝑒 𝑠 𝑡}$ and ${\hat{Q}}_{𝑡 𝑒 𝑠 𝑡}$ ), find $\hat{f}$ approximately maximizing $| {\hat{P}}_{𝑡 𝑟 𝑎 𝑖 𝑛} f - {\hat{Q}}_{𝑡 𝑟 𝑎 𝑖 𝑛} f |$ , then use $| {\hat{P}}_{𝑡 𝑒 𝑠 𝑡} \hat{f} - {\hat{Q}}_{𝑡 𝑒 𝑠 𝑡} \hat{f} |$ as an estimate.^[22]^[11]^[23]^[24] This estimator can possibly be consistent, but has a negative biasTemplate:R. In fact, no unbiased estimator can exist for any IPMTemplate:R, although there is for instance an unbiased estimator of the squared maximum mean discrepancy.Template:R

References

Template:Reflist

[mueller-1] 1.0 ^1.1 ^1.2 Template:Cite journal

[2] Template:Cite journal

[wgan-3] Template:Cite journal

[mmd-4] Template:Cite journal

[5] Template:Cite journal

[6] Template:Cite journal

[7] Template:Cite journal

[8] Template:Cite journal

[9] Template:Cite journal

[10] Template:Cite journal

[c2st-11] 11.0 ^11.1 Template:Cite journal

[gen-and-eq-12] 12.0 ^12.1 Template:Cite journal

[minimax-nns-13] 13.0 ^13.1 Template:Cite journal

[14] Template:Cite arXiv

[15] Template:Cite arXiv

[16] Template:Cite web

[towards-principled-gans-17] 17.0 ^17.1 Template:Cite journal

[18] Template:Cite journal

[on-ipms-19] Template:Cite arXiv

[20] Template:Cite journal

[21] Template:Cite arXiv

[demystifying-22] Template:Cite journal

[23] Template:Cite journal

[24] Template:Cite journal

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

Integral probability metric: Difference between revisions

Latest revision as of 15:34, 3 May 2024

Contents

Definition

Examples

Relationship to Template:Math-divergences

Estimation from samples

References

Navigation menu

Integral probability metric: Difference between revisions

Latest revision as of 15:34, 3 May 2024

Definition

Examples

Relationship to Template:Math-divergences

Estimation from samples

References

Navigation menu

Search