Ball divergence

From testwiki
Jump to navigation Jump to search

Template:Short description

Template:Orphan Ball Divergence (BD) is a novel statistical concept used to measure the difference between two probability distributions.[1] It was introduced to address the shortcomings of traditional methods for comparing distributions, particularly in high-dimensional, non-normal, or imbalanced datasets. Unlike classical tests such as the Student's t-test or Hotelling’s T² test, which often require assumptions about the data (e.g., normality), Ball Divergence is a nonparametric measure, meaning it does not rely on any specific assumptions about the distribution of the data. This makes it especially useful in situations where the data do not conform to these assumptions, such as when there are outliers or heavy-tailed distributions.

Background

In statistics, distinguishing between two unknown samples in multivariate data is an important and challenging task. This comparison is essential in various fields such as hypothesis testing, machine learning, bioinformatics, and environmental studies. Traditionally, this task has been handled using parametric methods such as the Student’s t-test or Hotelling’s T² test. These tests typically assume that the data come from distributions that satisfy certain conditions, such as normality, homogeneity of variances, or independent samples. However, in practice, these assumptions often do not hold, particularly when the data are high-dimensional, contain outliers, or have heavy tails. In these situations, traditional tests may fail to detect meaningful differences between the distributions, leading to incorrect conclusions.

Previously, a more common non-parametric two-sample test method was the energy distance test.[2] However, the effectiveness of the energy distance test relies on the assumption of moment conditions, making it less effective for extremely imbalanced data (where one sample size is disproportionately larger than the other). To address this issue, Chen, Dou, and Qiao proposed a non-parametric multivariate test method using ensemble subsampling nearest neighbors (ESS-NN) for imbalanced data.[3] This method effectively handles imbalanced data and increases the test's power by fixing the size of the smaller group while increasing the size of the larger group.

Additionally, Gretton et al. introduced the maximum mean discrepancy (MMD) for the two-sample problem.[4] Both methods require additional parameter settings, such as the number of groups 𝑘 in ESS-NN and the kernel function in MMD. Ball divergence addresses the two-sample test problem for extremely imbalanced samples without introducing other parameters.

Definition

The formal definition of Ball Divergence involves integrating the squared difference between two probability measures over a family of closed balls in a Banach space. This is achieved by first defining a metric (or distance function) within the space, which allows us to measure the distance between points. A closed ball around a point u is simply the set of all points that are within a fixed distance r from u, where r is the radius of the ball.

The Ball Divergence formula is given as follows: BD(μ,ν)=V×V[μν]2(B¯(u,ρ(u,v)))(μ(du)μ(dv)+ν(du)ν(du)), where:

  • μ and ν are the probability measures being compared.
  • B(u,ρ(u,v)) represents a closed ball in the space, centered at u, with a radius determined by the distance between points u and v as measured by the norm ρ(u,v).
  • The integral is taken over all possible pairs of points, summing the squared differences of the two measures over all such balls.

This measure allows for a detailed, scale-sensitive comparison between the two distributions. The integral captures the global differences between the distributions, but the fact that it is defined over balls means that the comparison is inherently local as well, making it robust to variations in the data and more sensitive to local differences than traditional non-parameter methods.

Testing for equal distributions

Next, we will try to give a sample version of Ball Divergence. For convenience, we can decompose the Ball Divergence into two parts: A=V×V[μν]2(B¯(u,ρ(u,v)))μ(du)μ(dv), and C=V×V[μν]2(B¯(u,ρ(u,v)))ν(du)ν(dv). Thus BD(μ,ν)=A+C.

Let δ(x,y,z)=I(zB¯(x,ρ(x,y))) denote whether point z locates in the ball B¯(x,ρ(x,y)). Given two independent samples {X1,,Xn} form μ and {Y1,,Ym} form ν

AijX=1nu=1nδ(Xi,Xj,Xu),AijY=1mv=1mδ(Xi,Xj,Yv),CklX=1nu=1nδ(Yk,Yl,Xu),CijY=1mv=1mδ(Yk,Yl,Yv), where AijX means the proportion of samples from the probability measure μ located in the ball B¯(Xi,ρ(Xi,Xj)) and AijY means the proportion of samples from the probability measure ν located in the ball B¯(Xi,ρ(Xi,Xj)). Meanwhile, CijX and CijY means the proportion of samples from the probability measure μ and ν located in the ball B¯(Yi,ρ(Yi,Yj)). The sample versions of A and C are as follows

An,m=1n2i,j=1n(AijXAijY)2,Cn,m=1m2k,l=1m(CklXCklY)2. Finally, we can give the sample ball divergence

BDn,m=An,m+Cn,m.

Properties

1. BD(μ,ν)0, where the equality holds if and only if μ=ν.

2. The square root of Ball Divergence does not satisfy the triangle inequality, so it is a symmetric divergence but not a metric.

3. BD can be generalized to the K-sample problem.Suppose that μ1,,μK are K measures on Banach space.We can define that

D(μ1,,μK)=1lkKV×V[μkμl]2(B¯(u,ρ(u,v)))(μ(du)μ(dv)+ν(du)ν(du)).

Clearly, D(\mu_1, \ldots, \mu_K)=0 if and only if μ1==μK.


3.Consistency: We have

Dn,mn,m a.s. D(μ,v), where nn+mτ for some τ[0,1].

Define ξ(x,y,z1,z2)=δ(x,y,z1)δ(x,y,z2), and then let Q(x,y;x,y)=(ϕA(2,0)(x,x)+ϕA(1,1)(x,y)+ϕA(1,1)(x,y)+ϕA(0,2)(y,y)), where

ϕA(2,0)(x,x)=E[ξ(X1,X2,x,x)]+E[ξ(X1,X2,Y,Y3)]E[ξ(X1,X2,x,Y)]E[ξ(X1,X2,x,Y3)]ϕA(1,1)(x,y)=E[ξ(X1,X2,x,X3)]+E[ξ(X1,X2,y,Y3)]E[ξ(X1,X2,x,y)]E[ξ(X1,X2,X3,Y3)]ϕA(0,2)(y,y)=E[ξ(X1,X2,X,X3)]+E[ξ(X1,X2,y,y)]E[ξ(X1,X2,X,y)]E[ξ(X1,X2,X,y)]. The function Q(x,y;x,y) has spectral decomposition: Q(x,y;x,y)=k=1λkfk(x,y)fk(x,y), where λk and fk are the eigenvalues and eigenfunctions of Q. For k=1,2,, Z1k,Z2k are i.i.d. N(0,1), and ak2(τ)=(1τ)EX[EYfk(X,Y)]2,bk2(τ)=τEY[EXfk(X,Y)]2,θ=2E[E(δ(X1,X2,X)(1δ(X1,X2,Y))X1,X2)].

4.Asymptotic distribution under the null hypothesis: Suppose that both n and m in such a way that nn+mτ,0τ1. Under the null hypothesis, we have nmn+mBDn,mndk=12λk[(ak(τ)Z1k+bk(τ)Z2k)2(ak2(τ)+bk2(τ))]+θ

5. Distribution under the alternative hypothesis: let δ1,02=Var(g(1,0)(X)) and δ0,12=Var(g(0,1)(Y)). Suppose that both n and m in such a way that nn+mτ,0τ1. Under the alternative hypothesis, we have nmn+m(BDn,mBD(μ,ν))dnN(0,(1τ)δ1,02+τδ0,12).

6. The test based on Dn,m is consistent against any general alternative H1. More specifically, limnVarH1(Dn,m)=0 and Δ(η):=lim infn(EH1Dn,mEH0Dn,m)>0. More importantly, Δ(η) can also be expressed as Δ(η)D(μ,ν), which is independent of η.

References

Template:Reflist