Distributional data analysis

From testwiki
Jump to navigation Jump to search

Template:Short description

Template:Orphan Distributional data analysis is a branch of nonparametric statistics that is related to functional data analysis. It is concerned with random objects that are probability distributions, i.e., the statistical analysis of samples of random distributions where each atom of a sample is a distribution. One of the main challenges in distributional data analysis is that although the space of probability distributions is a convex space, it is not a vector space.

Notation

Let ν be a probability measure on D, where Dp with p1. The probability measure ν can be equivalently characterized as cumulative distribution function F or probability density function f if it exists. For univariate distributions with p=1, quantile function Q=F1 can also be used.

Let be a space of distributions ν and let d be a metric on so that (,d) forms a metric space. There are various metrics available for d.[1] For example, suppose ν1,ν2, and let f1 and f2 be the density functions of ν1 and ν2, respectively. The Fisher-Rao metric is defined as dFR(f1,f2)=arccos(Df1(x)f2(x)dx).

For univariate distributions, let Q1 and Q2 be the quantile functions of ν1 and ν2. Denote the Lp-Wasserstein space as 𝒲p, which is the space of distributions with finite p-th moments. Then, for ν1,ν2𝒲p, the Lp-Wasserstein metric is defined as dWp(ν1,ν2)=(01[Q1(s)Q2(s)]pds)1/p.

Mean and variance

For a probability measure ν, consider a random process 𝔉 such that ν𝔉. One way to define mean and variance of ν is to introduce the Fréchet mean and the Fréchet variance. With respect to the metric d on , the Fréchet mean μ, also known as the barycenter, and the Fréchet variance V are defined as[2] μ=argminμ𝔼[d2(ν,μ)],V=𝔼[d2(ν,μ)].

A widely used example is the Wasserstein-Fréchet mean, or simply the Wasserstein mean, which is the Fréchet mean with the L2-Wasserstein metric dW2.[3] For ν,μ𝒲2, let Qν,Qμ be the quantile functions of ν and μ, respectively. The Wasserstein mean and Wasserstein variance is defined as μ*=argminμ𝒲2𝔼[01(Qν(s)Qμ(s))2ds],V*=𝔼[01(Qν(s)Qμ*(s))2ds].

Modes of variation

Modes of variation are useful concepts in depicting the variation of data around the mean function. Based on the Karhunen-Loève representation, modes of variation show the contribution of each eigenfunction to the mean.

Functional principal component analysis

Functional principal component analysis (FPCA) can be directly applied to the probability density functions.[4] Consider a distribution process ν𝔉 and let f be the density function of ν. Let the mean density function as μ(t)=𝔼[f(t)] and the covariance function as G(s,t)=Cov(f(s),f(t)) with orthonormal eigenfunctions {ϕj}j=1 and eigenvalues {λj}j=1.

By the Karhunen-Loève theorem, f(t)=μ(t)+j=1ξjϕj(t), where principal components ξj=D[f(t)μ(t)]ϕj(t)dt. The jth mode of variation is defined as gj(t,α)=μ(t)+αλjϕj(t),tD,α[A,A] with some constant A, such as 2 or 3.

Transformation FPCA

Assume the probability density functions f exist, and let f be the space of density functions. Transformation approaches introduce a continuous and invertible transformation Ψ:f, where is a Hilbert space of functions. For instance, the log quantile density transformation or the centered log ratio transformation are popular choices.[5][6]

For ff, let Y=Ψ(f), the transformed functional variable. The mean function μY(t)=𝔼[Y(t)] and the covariance function GY(s,t)=Cov(Y(s),Y(t)) are defined accordingly, and let {λj,ϕj}j=1 be the eigenpairs of GY(s,t). The Karhunen-Loève decomposition gives Y(t)=μY(t)+j=1ξjϕj(t), where ξj=D[Y(t)μY(t)]ϕj(t)dt. Then, the jth transformation mode of variation is defined as[7] gjTF(t,α)=Ψ1(μY+αλjϕj)(t),tD,α[A,A].

Log FPCA and Wasserstein Geodesic PCA

Endowed with metrics such as the Wasserstein metric dW2 or the Fisher-Rao metric dFR, we can employ the (pseudo) Riemannian structure of . Denote the tangent space at the Fréchet mean μ as Tμ, and define the logarithm and exponential maps logμ:Tμ and expμ:Tμ. Let Y be the projected density onto the tangent space, Y=logμ(f).

In Log FPCA, FPCA is performed to Y and then projected back to using the exponential map.[8] Therefore, with Y(t)=μY(t)+j=1ξjϕj(t), the jth Log FPCA mode of variation is defined as gjLog(t,α)=expf(μf+αλjϕj)(t),tD,α[A,A].

As a special case, consider L2-Wasserstein space 𝒲2, a random distribution ν𝒲2, and a subset G𝒲2. Let dW2(ν,G)=infμGdW2(ν,μ) and KW2(G)=𝔼[dW22(ν,G)]. Let CL(𝒲2) be the metric space of nonempty, closed subsets of 𝒲2, endowed with Hausdorff distance, and define CGν0,k(𝒲2)={GCL(𝒲2):ν0G,G is a geodesic set s.t. dim(G)k},k1. Let the reference measure ν0 be the Wasserstein mean μ. Then, a principal geodesic subspace (PGS) of dimension k with respect to μ is a set Gk=argminGCGν,k(𝒲2)KW2(G).[9][10]

Note that the tangent space Tμ is a subspace of Lμ2, the Hilbert space of μ-square-integrable functions. Obtaining the PGS is equivalent to performing PCA in Lμ2 under constraints to lie in the convex and closed subset.[10] Therefore, a simple approximation of the Wasserstein Geodesic PCA is the Log FPCA by relaxing the geodesicity constraint, while alternative techniques are suggested.[9][10]

Distributional regression

Fréchet regression

Fréchet regression is a generalization of regression with responses taking values in a metric space and Euclidean predictors.[11][12] Using the Wasserstein metric dW2, Fréchet regression models can be applied to distributional objects. The global Wasserstein-Fréchet regression model is defined as Template:NumBlk which generalizes the standard linear regression.

For the local Wasserstein-Fréchet regression, consider a scalar predictor X and introduce a smoothing kernel Kh()=h1K(/h). The local Fréchet regression model, which generalizes the local linear regression model, is defined as l(x)=argminω𝔼[sL(X,x,h)dW22(ν,ω)],sL(X,x,h)=σ02{Kh(Xx)[μ2μ1(Xx)]}, where μj=𝔼[Kh(Xx)(Xx)j], j=0,1,2, and σ02=μ0μ2μ12.

Transformation based approaches

Consider the response variable ν to be probability distributions. With the space of density functions f and a Hilbert space of functions , consider continuous and invertible transformations Ψ:f. Examples of transformations include log hazard transformation, log quantile density transformation, or centered log-ratio transformation. Linear methods such as functional linear models are applied to the transformed variables. The fitted models are interpreted back in the original density space using the inverse transformation.[12]

Random object approaches

In Wasserstein regression, both predictors ω and responses ν can be distributional objects. Let ω and ν be the Wasserstein mean of ω and ν, respectively. The Wasserstein regression model is defined as 𝔼(logνν|logωω)=Γ(logωω), with a linear regression operator Γg(t)=β(,t),gω,tD,gTω,β:D2. Estimation of the regression operator is based on empirical estimators obtained from samples.[13] Also, the Fisher-Rao metric dFR can be used in a similar fashion.[12][14]

Hypothesis testing

Wasserstein F-test

Wasserstein F-test has been proposed to test for the effects of the predictors in the Fréchet regression framework with the Wasserstein metric.[15] Consider Euclidean predictors Xp and distributional responses ν𝒲2. Denote the Wasserstein mean of ν as μ*, and the sample Wasserstein mean as μ^*. Consider the global Wasserstein-Fréchet regression model m(x) defined in (Template:EquationNote), which is the conditional Wasserstein mean given X=x. The estimator of m(x), m^(x) is obtained by minimizing the empirical version of the criterion.

Let F, Q, f, F*, Q*, f*, F(x), Q(x), and f(x) denote the cumulative distribution, quantile, and density functions of ν, μ*, and m(x), respectively. For a pair (X,ν), define T=QF(X) be the optimal transport map from m(X) to ν. Also, define S=Q(X)F*, the optimal transport map from μ* to m(x). Finally, define the covariance kernel K(u,v)=𝔼[Cov((TS)(u),(TS)(v))] and by the Mercer decomposition, K(u,v)=j=1λjϕj(u)ϕj(v).

If there are no regression effects, the conditional Wasserstein mean would equal the Wasserstein mean. That is, hypotheses for the test of no effects are H0:m(x)μ*v.s.H1:Not H0. To test for these hypotheses, the proposed global Wasserstein F-statistic and its asymptotic distribution are FG=i=1ndW22(m^(x),μ^*),FG|X1,,Xndj=1λjVja.s., where Vjiidχp2.[15] An extension to hypothesis testing for partial regression effects, and alternative testing approximations using the Satterthwaite's approximation or a bootstrap approach are proposed.[15]

Tests for the intrinsic mean

The Hilbert sphere 𝒮 is defined as 𝒮={f:f=1}, where is a separable infinite-dimensional Hilbert space with inner product , and norm . Consider the space of square root densities 𝒳={x:D:x=f,Df(t)dt=1}. Then with the Fisher-Rao metric dFR on f, 𝒳 is the positive orthant of the Hilbert sphere 𝒮 with =L2(D).

Let a chart τ:U𝒮𝔾 as a smooth homeomorphism that maps U onto an open subset τ(U) of a separable Hilbert space 𝔾 for coordinates. For example, τ can be the logarithm map.[14]

Consider a random element x=f𝒳 equipped with the Fisher-Rao metric, and write its Fréchet mean as μ. Let the empirical estimator of μ using n samples as μ^. Then central limit theorem for μ^τ=τ(μ^) and μτ=τ(μ) holds: n(μ^τμτ)LZ,n, where Z is a Gaussian random element in 𝔾 with mean 0 and covariance operator 𝒯. Let the eigenvalue-eigenfunction pairs of 𝒯 and the estimated covariance operator 𝒯^ as (λk,ϕk)k=1 and (λ^k,ϕ^k)k=1, respectively.

Consider one-sample hypothesis testing H0:μ=μ0v.s.H1:μμ0, with μ0𝒮. Denote 𝔾 and ,𝔾 as the norm and inner product in 𝔾. The test statistics and their limiting distributions are T1=nτ(μ^)τ(μ0)𝔾2LλkWk,S1=nk=1Kτ(μ^)τ(μ0),ϕ^k𝔾2λ^kLχK2, where Wkiidχ12. The actual testing procedure can be done by employing the limiting distributions with Monte Carlo simulations, or bootstrap tests are possible. An extension to the two-sample test and paired test are also proposed.[14]

Distributional time series

Autoregressive (AR) models for distributional time series are constructed by defining stationarity and utilizing the notion of difference between distributions using dW2 and dFR.

In Wasserstein autoregressive model (WAR), consider a stationary density time series ft with Wasserstein mean f.[16] Denote the difference between ft and f using the logarithm map, ftf=logfft=Ttid, where Tt=QtF is the optimal transport from f to ft in which Ft and F are the cdf of ft and f. An AR(1) model on the tangent space Tf is defined as Vt=βVt1+ϵt,t, for VtTf with the autoregressive parameter β and mean zero random i.i.d. innovations ϵt. Under proper conditions, μt=expf(Vt) with densities ft and Vt=logf(μt). Accordingly, WAR(1), with a natural extension to order p, is defined as Ttid=β(Tt1id)+ϵt.

On the other hand, the spherical autoregressive model (SAR) considers the Fisher-Rao metric.[17] Following the settings of ##Tests for the intrinsic mean, let xt𝒳 with Fréchet mean μx. Let θ=arccos(xt,μx), which is the geodesic distance between xt and μx. Define a rotation operator Qxt,μx that rotates xt to μx. The spherical difference between xt and μx is represented as Rt=xtμx=θQxt,μx. Assume that Rt is a stationary sequence with the Fréchet mean μR, then SAR(1) is defined as RtμR=β(Rt1μR)+ϵt, where μR=𝔼Rt and mean zero random i.i.d innovations ϵt. An alternative model, the differenced based spherical autoregressive (DSAR) model is defined with Rt=xt+1xt, with natural extensions to order p. A similar extension to the Wasserstein space was introduced.[18]

References

Template:Reflist