Ball divergence

Template:Short description

Template:Orphan Ball Divergence (BD) is a novel statistical concept used to measure the difference between two probability distributions.^[1] It was introduced to address the shortcomings of traditional methods for comparing distributions, particularly in high-dimensional, non-normal, or imbalanced datasets. Unlike classical tests such as the Student's t-test or Hotelling’s T² test, which often require assumptions about the data (e.g., normality), Ball Divergence is a nonparametric measure, meaning it does not rely on any specific assumptions about the distribution of the data. This makes it especially useful in situations where the data do not conform to these assumptions, such as when there are outliers or heavy-tailed distributions.

Background

In statistics, distinguishing between two unknown samples in multivariate data is an important and challenging task. This comparison is essential in various fields such as hypothesis testing, machine learning, bioinformatics, and environmental studies. Traditionally, this task has been handled using parametric methods such as the Student’s t-test or Hotelling’s T² test. These tests typically assume that the data come from distributions that satisfy certain conditions, such as normality, homogeneity of variances, or independent samples. However, in practice, these assumptions often do not hold, particularly when the data are high-dimensional, contain outliers, or have heavy tails. In these situations, traditional tests may fail to detect meaningful differences between the distributions, leading to incorrect conclusions.

Previously, a more common non-parametric two-sample test method was the energy distance test.^[2] However, the effectiveness of the energy distance test relies on the assumption of moment conditions, making it less effective for extremely imbalanced data (where one sample size is disproportionately larger than the other). To address this issue, Chen, Dou, and Qiao proposed a non-parametric multivariate test method using ensemble subsampling nearest neighbors (ESS-NN) for imbalanced data.^[3] This method effectively handles imbalanced data and increases the test's power by fixing the size of the smaller group while increasing the size of the larger group.

Additionally, Gretton et al. introduced the maximum mean discrepancy (MMD) for the two-sample problem.^[4] Both methods require additional parameter settings, such as the number of groups 𝑘 in ESS-NN and the kernel function in MMD. Ball divergence addresses the two-sample test problem for extremely imbalanced samples without introducing other parameters.

Definition

The formal definition of Ball Divergence involves integrating the squared difference between two probability measures over a family of closed balls in a Banach space. This is achieved by first defining a metric (or distance function) within the space, which allows us to measure the distance between points. A closed ball around a point $u$ is simply the set of all points that are within a fixed distance $r$ from $u$ , where $r$ is the radius of the ball.

The Ball Divergence formula is given as follows: $B D (μ, ν) = \iint_{V \times V} [μ - ν]^{2} (\bar{B} (u, ρ (u, v))) (μ (d u) μ (d v) + ν (d u) ν (d u)),$ where:

$μ$ and $ν$ are the probability measures being compared.
$B (u, ρ (u, v))$ represents a closed ball in the space, centered at $u$ , with a radius determined by the distance between points $u$ and $v$ as measured by the norm $ρ (u, v)$ .
The integral is taken over all possible pairs of points, summing the squared differences of the two measures over all such balls.

This measure allows for a detailed, scale-sensitive comparison between the two distributions. The integral captures the global differences between the distributions, but the fact that it is defined over balls means that the comparison is inherently local as well, making it robust to variations in the data and more sensitive to local differences than traditional non-parameter methods.

Testing for equal distributions

Next, we will try to give a sample version of Ball Divergence. For convenience, we can decompose the Ball Divergence into two parts: $A = \iint_{V \times V} [μ - ν]^{2} (\bar{B} (u, ρ (u, v))) μ (d u) μ (d v),$ and $C = \iint_{V \times V} [μ - ν]^{2} (\bar{B} (u, ρ (u, v))) ν (d u) ν (d v) .$ Thus $B D (μ, ν) = A + C .$

Let $δ (x, y, z) = I (z \in \bar{B} (x, ρ (x, y)))$ denote whether point $z$ locates in the ball $\bar{B} (x, ρ (x, y))$ . Given two independent samples ${X_{1}, \dots, X_{n}}$ form $μ$ and ${Y_{1}, \dots, Y_{m}}$ form $ν$

$\begin{matrix} A_{i j}^{X} = \frac{1}{n} \sum_{u = 1}^{n} δ (X_{i}, X_{j}, X_{u}), A_{i j}^{Y} = \frac{1}{m} \sum_{v = 1}^{m} δ (X_{i}, X_{j}, Y_{v}), \\ C_{k l}^{X} = \frac{1}{n} \sum_{u = 1}^{n} δ (Y_{k}, Y_{l}, X_{u}), C_{i j}^{Y} = \frac{1}{m} \sum_{v = 1}^{m} δ (Y_{k}, Y_{l}, Y_{v}), \end{matrix}$ where $A_{i j}^{X}$ means the proportion of samples from the probability measure $μ$ located in the ball $\bar{B} (X_{i}, ρ (X_{i}, X_{j}))$ and $A_{i j}^{Y}$ means the proportion of samples from the probability measure $ν$ located in the ball $\bar{B} (X_{i}, ρ (X_{i}, X_{j}))$ . Meanwhile, $C_{i j}^{X}$ and $C_{i j}^{Y}$ means the proportion of samples from the probability measure $μ$ and $ν$ located in the ball $\bar{B} (Y_{i}, ρ (Y_{i}, Y_{j}))$ . The sample versions of $A$ and $C$ are as follows

$A_{n, m} = \frac{1}{n^{2}} \sum_{i, j = 1}^{n} {(A_{i j}^{X} - A_{i j}^{Y})}^{2}, C_{n, m} = \frac{1}{m^{2}} \sum_{k, l = 1}^{m} {(C_{k l}^{X} - C_{k l}^{Y})}^{2} .$ Finally, we can give the sample ball divergence

$B D_{n, m} = A_{n, m} + C_{n, m} .$

Properties

1. $B D (μ, ν) \geq 0$ , where the equality holds if and only if $μ = ν$ .

2. The square root of Ball Divergence does not satisfy the triangle inequality, so it is a symmetric divergence but not a metric.

3. BD can be generalized to the K-sample problem.Suppose that $μ_{1}, \dots, μ_{K}$ are $K$ measures on Banach space.We can define that

$D (μ_{1}, \dots, μ_{K}) = \sum_{1 \leq l \leq k \leq K} \iint_{V \times V} [μ_{k} - μ_{l}]^{2} (\bar{B} (u, ρ (u, v))) (μ (d u) μ (d v) + ν (d u) ν (d u)) .$

Clearly, D(\mu_1, \ldots, \mu_K)=0 if and only if $μ_{1} = \dots = μ_{K}$ .

3.Consistency: We have

$D_{n, m} \to_{n, m \to \infty}^{a.s.} D (μ, v),$ where $\frac{n}{n + m} \to τ$ for some $τ \in [0, 1]$ .

Define $ξ (x, y, z_{1}, z_{2}) = δ (x, y, z_{1}) \cdot δ (x, y, z_{2})$ , and then let $Q (x, y; x^{'}, y^{'}) = (ϕ_{A}^{(2, 0)} (x, x^{'}) + ϕ_{A}^{(1, 1)} (x, y) + ϕ_{A}^{(1, 1)} (x^{'}, y^{'}) + ϕ_{A}^{(0, 2)} (y, y^{'})),$ where

$\begin{matrix} ϕ_{A}^{(2, 0)} (x, x^{'}) = & E [ξ (X_{1}, X_{2}, x, x^{'})] + E [ξ (X_{1}, X_{2}, Y, Y_{3})] \\ - E [ξ (X_{1}, X_{2}, x, Y)] - E [ξ (X_{1}, X_{2}, x^{'}, Y_{3})] \\ ϕ_{A}^{(1, 1)} (x, y) = & E [ξ (X_{1}, X_{2}, x, X_{3})] + E [ξ (X_{1}, X_{2}, y, Y_{3})] \\ - E [ξ (X_{1}, X_{2}, x, y)] - E [ξ (X_{1}, X_{2}, X_{3}, Y_{3})] \\ ϕ_{A}^{(0, 2)} (y, y^{'}) = & E [ξ (X_{1}, X_{2}, X, X_{3})] + E [ξ (X_{1}, X_{2}, y, y^{'})] \\ - E [ξ (X_{1}, X_{2}, X, y)] - E [ξ (X_{1}, X_{2}, X, y^{'})] . \end{matrix}$ The function $Q (x, y; x^{'}, y^{'})$ has spectral decomposition: $Q (x, y; x^{'}, y^{'}) = \sum_{k = 1}^{\infty} λ_{k} f_{k} (x, y) f_{k} (x^{'}, y^{'}),$ where $λ_{k}$ and $f_{k}$ are the eigenvalues and eigenfunctions of $Q$ . For $k = 1, 2, \dots$ , $Z_{1 k}, Z_{2 k}$ are i.i.d. $N (0, 1)$ , and $\begin{matrix} a_{k}^{2} (τ) & = (1 - τ) E_{X} {[E_{Y} f_{k} (X, Y)]}^{2}, b_{k}^{2} (τ) = τ E_{Y} {[E_{X} f_{k} (X, Y)]}^{2}, \\ θ & = 2 E [E (δ (X_{1}, X_{2}, X) (1 - δ (X_{1}, X_{2}, Y)) ∣ X_{1}, X_{2})] . \end{matrix}$

4.Asymptotic distribution under the null hypothesis: Suppose that both $n$ and $m \to \infty$ in such a way that $\frac{n}{n + m} \to τ, 0 \leq τ \leq 1$ . Under the null hypothesis, we have $\frac{n m}{n + m} B D_{n, m} \to_{n \to \infty}^{d} \sum_{k = 1}^{\infty} 2 λ_{k} [{(a_{k} (τ) Z_{1 k} + b_{k} (τ) Z_{2 k})}^{2} - (a_{k}^{2} (τ) + b_{k}^{2} (τ))] + θ .$

5. Distribution under the alternative hypothesis: let $δ_{1, 0}^{2} = Var (g^{(1, 0)} (X)) and δ_{0, 1}^{2} = Var (g^{(0, 1)} (Y)) .$ Suppose that both $n$ and $m \to \infty$ in such a way that $\frac{n}{n + m} \to τ, 0 \leq τ \leq 1$ . Under the alternative hypothesis, we have $\sqrt{\frac{n m}{n + m}} (B D_{n, m} - B D (μ, ν)) \underset{n \to \infty}{d} N (0, (1 - τ) δ_{1, 0}^{2} + τ δ_{0, 1}^{2}) .$

6. The test based on $D_{n, m}$ is consistent against any general alternative $H_{1}$ . More specifically, $\lim_{n \to \infty} {Var}_{H_{1}} (D_{n, m}) = 0$ and $Δ (η) : = \underset{n \to \infty}{lim inf} (E_{H_{1}} D_{n, m} - E_{H_{0}} D_{n, m}) > 0 .$ More importantly, $Δ (η)$ can also be expressed as $Δ (η) \equiv D (μ, ν),$ which is independent of $η$ .

References

Template:Reflist

[1] Template:Cite journal

[2] Template:Cite journal

[3] Template:Cite journal

[4] Template:Citation

[1]

[2]

[3]

[4]

Ball divergence

Contents

Background

Definition

Testing for equal distributions

Properties

References

Navigation menu

Ball divergence

Background

Definition

Testing for equal distributions

Properties

References

Navigation menu

Search