Ball divergence
Template:Orphan Ball Divergence (BD) is a novel statistical concept used to measure the difference between two probability distributions.[1] It was introduced to address the shortcomings of traditional methods for comparing distributions, particularly in high-dimensional, non-normal, or imbalanced datasets. Unlike classical tests such as the Student's t-test or Hotelling’s T² test, which often require assumptions about the data (e.g., normality), Ball Divergence is a nonparametric measure, meaning it does not rely on any specific assumptions about the distribution of the data. This makes it especially useful in situations where the data do not conform to these assumptions, such as when there are outliers or heavy-tailed distributions.
Background
In statistics, distinguishing between two unknown samples in multivariate data is an important and challenging task. This comparison is essential in various fields such as hypothesis testing, machine learning, bioinformatics, and environmental studies. Traditionally, this task has been handled using parametric methods such as the Student’s t-test or Hotelling’s T² test. These tests typically assume that the data come from distributions that satisfy certain conditions, such as normality, homogeneity of variances, or independent samples. However, in practice, these assumptions often do not hold, particularly when the data are high-dimensional, contain outliers, or have heavy tails. In these situations, traditional tests may fail to detect meaningful differences between the distributions, leading to incorrect conclusions.
Previously, a more common non-parametric two-sample test method was the energy distance test.[2] However, the effectiveness of the energy distance test relies on the assumption of moment conditions, making it less effective for extremely imbalanced data (where one sample size is disproportionately larger than the other). To address this issue, Chen, Dou, and Qiao proposed a non-parametric multivariate test method using ensemble subsampling nearest neighbors (ESS-NN) for imbalanced data.[3] This method effectively handles imbalanced data and increases the test's power by fixing the size of the smaller group while increasing the size of the larger group.
Additionally, Gretton et al. introduced the maximum mean discrepancy (MMD) for the two-sample problem.[4] Both methods require additional parameter settings, such as the number of groups 𝑘 in ESS-NN and the kernel function in MMD. Ball divergence addresses the two-sample test problem for extremely imbalanced samples without introducing other parameters.
Definition
The formal definition of Ball Divergence involves integrating the squared difference between two probability measures over a family of closed balls in a Banach space. This is achieved by first defining a metric (or distance function) within the space, which allows us to measure the distance between points. A closed ball around a point is simply the set of all points that are within a fixed distance from , where is the radius of the ball.
The Ball Divergence formula is given as follows: where:
- and are the probability measures being compared.
- represents a closed ball in the space, centered at , with a radius determined by the distance between points and as measured by the norm .
- The integral is taken over all possible pairs of points, summing the squared differences of the two measures over all such balls.
This measure allows for a detailed, scale-sensitive comparison between the two distributions. The integral captures the global differences between the distributions, but the fact that it is defined over balls means that the comparison is inherently local as well, making it robust to variations in the data and more sensitive to local differences than traditional non-parameter methods.
Testing for equal distributions
Next, we will try to give a sample version of Ball Divergence. For convenience, we can decompose the Ball Divergence into two parts: and Thus
Let denote whether point locates in the ball . Given two independent samples form and form
where means the proportion of samples from the probability measure located in the ball and means the proportion of samples from the probability measure located in the ball . Meanwhile, and means the proportion of samples from the probability measure and located in the ball . The sample versions of and are as follows
Finally, we can give the sample ball divergence
Properties
1. , where the equality holds if and only if .
2. The square root of Ball Divergence does not satisfy the triangle inequality, so it is a symmetric divergence but not a metric.
3. BD can be generalized to the K-sample problem.Suppose that are measures on Banach space.We can define that
Clearly, D(\mu_1, \ldots, \mu_K)=0 if and only if .
3.Consistency: We have
where for some .
Define , and then let where
The function has spectral decomposition: where and are the eigenvalues and eigenfunctions of . For , are i.i.d. , and
4.Asymptotic distribution under the null hypothesis: Suppose that both and in such a way that . Under the null hypothesis, we have
5. Distribution under the alternative hypothesis: let Suppose that both and in such a way that . Under the alternative hypothesis, we have
6. The test based on is consistent against any general alternative . More specifically, and More importantly, can also be expressed as which is independent of .