Johnson–Lindenstrauss lemma

From testwiki
Jump to navigation Jump to search

Template:Short description In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.

The lemma has applications in compressed sensing, manifold learning, dimensionality reduction, graph embedding, and natural language processing. Much of the data stored and manipulated on computers, including text and images, can be represented as points in a high-dimensional space (see vector space model for the case of text). However, the essential algorithms for working with such data tend to become bogged down very quickly as dimension increases.[1] It is therefore desirable to reduce the dimensionality of the data in a way that preserves its relevant structure.

Statement

Given 0<ε<1, a set X of N points in n , and an integer k>8(lnN)/ε2,[2] there is a linear map f:nk such that

(1ε)uv2f(u)f(v)2(1+ε)uv2

for all u,vX.

The formula can be rearranged:(1+ε)1f(u)f(v)2uv2(1ε)1f(u)f(v)2

Alternatively, for any ϵ(0,1) and any integer k15(lnN)/ε2[Note 1] there exists a linear function f:nk such that the restriction f|X is (1+ε)-bi-Lipschitz.[Note 2]

Also, the lemma is tight up to a constant factor, i.e. there exists a set of points of size N that needs dimension

Ω(log(N)ε2)

in order to preserve the distances between all pairs of points within a factor of (1±ε).[3][4]

The classical proof of the lemma takes f to be a scalar multiple of an orthogonal projection P onto a random subspace of dimension k in n. An orthogonal projection collapses some dimensions of the space it is applied to, which reduces the length of all vectors, as well as distance between vectors in the space. Under the conditions of the lemma, concentration of measure ensures there is a nonzero chance that a random orthogonal projection reduces pairwise distances between all points in X by roughly a constant factor c. Since the chance is nonzero, such projections must exist, so we can choose one P and set f(v)=Pv/c.

To obtain the projection algorithmically, it suffices with high probability to repeatedly sample orthogonal projection matrices at random. If you keep rolling the dice, you will eventually obtain one in polynomial random time.

Proof

Based on.[5]

Construct a random matrix A𝒩(0,1)k×n, obtained by sampling each entry from the standard normal distribution. Then define P:=A/k. Then, for any nonzero vector xn, let the projected vector be x^:=Px. Standard geometric argument show that r:=x^2x2 is chi-square distributed, that is, rχ2(k). Thus, it satisfies a concentration inequalityPr(r(1±ϵ)k)12ek2(12ϵ213ϵ3)By the union bound, the probability that this relation is true for all of x1,,xN is greater than 12Nek2(12ϵ213ϵ3).

When k4ln2Nϵ2(12ϵ/3), the probability is nonzero.

More generally, when k4(d+1)ln2Nϵ2(12ϵ/3), the probability is 11/(2N)d, allowing arbitrarily high probability of success per sample, and a fortiori polynomial random time.

Alternate statement

A related lemma is the distributional JL lemma. This lemma states that for any 0<ε,δ<1/2 and positive integer d, there exists a distribution over k×d from which the matrix A is drawn such that for k=O(ε2log(1/δ)) and for any unit-length vector xd, the claim below holds.[6]

P(|Ax221|>ε)<δ

One can obtain the JL lemma from the distributional version by setting x=(uv)/uv2 and δ<1/n2 for some pair u,v both in X. Then the JL lemma follows by a union bound over all such pairs.

Sparse JL transform

Database-friendly JL transform

(Achlioptas, 2003)[7] proposed "database-friendly" JL transform, using matrices with only entries from (-1, 0, +1).

Template:Math theorem

Fix some unit vector vn. Define Qi:=jRijvj. We have f(v)22=1kiQi2.

Now, since the Q1,,Qk are IID, we want to apply a Chernoff concentration bound for 1kiQi2 around 1. This requires upper-bounding the cumulant generating function (CGF).

Template:Math theorem

Template:Hidden begin

Template:Math proofTemplate:Hidden end

Now that Qi is stochastically dominated by the standard gaussian, and E[Qi2]=1, it remains to perform a Chernoff bound for Qi2, which requires bounding the cumulant generating function on both ends.

Template:Hidden begin

Template:Math proofTemplate:Hidden end

Sparser JL transform on well-spread vectors

(Matoušek, 2008)[8] proposed a variant of the above JL transform that is even more sparsified, though it only works on "well-spread" vectors.

Template:Math theoremThe above cases are generalized to the case for matrices with independent, mean-zero, unit variance, subgaussian entries in (Dirksen, 2016).[9]

Speeding up the JL transform

Given A, computing the matrix vector product takes O(kd) time. There has been some work in deriving distributions for which the matrix vector product can be computed in less than O(kd) time.

There are two major lines of work. The first, Fast Johnson Lindenstrauss Transform (FJLT),[10] was introduced by Ailon and Chazelle in 2006. This method allows the computation of the matrix vector product in just dlogd+k2+γ for any constant γ>0.

Another approach is to build a distribution supported over matrices that are sparse.[11] This method allows keeping only an ε fraction of the entries in the matrix, which means the computation can be done in just kdε time. Furthermore, if the vector has only b non-zero entries, the Sparse JL takes time kbε, which may be much less than the dlogd time used by Fast JL.

Tensorized random projections

It is possible to combine two JL matrices by taking the so-called face-splitting product, which is defined as the tensor products of the rows (was proposed by V. Slyusar[12] in 1996[13][14][15][16][17] for radar and digital antenna array applications). More directly, let C3×3 and D3×3 be two matrices. Then the face-splitting product CD is[13][14][15][16][17]

CD=[C1D1C2D2C3D3].

This idea of tensorization was used by Kasiviswanathan et al. for differential privacy.[18]

JL matrices defined like this use fewer random bits, and can be applied quickly to vectors that have tensor structure, due to the following identity:[15]

(𝐂𝐃)(xy)=𝐂x𝐃y=[(𝐂x)1(𝐃y)1(𝐂x)2(𝐃y)2],

where is the element-wise (Hadamard) product. Such computations have been used to efficiently compute polynomial kernels and many other Template:Clarify.[19]

In 2020[20] it was shown that if the matrices C1,C2,,Cc are independent ±1 or Gaussian matrices, the combined matrix C1Cc satisfies the distributional JL lemma if the number of rows is at least

O(ϵ2log1/δ+ϵ1(1clog1/δ)c).

For large ϵ this is as good as the completely random Johnson-Lindenstrauss, but a matching lower bound in the same paper shows that this exponential dependency on (log1/δ)c is necessary. Alternative JL constructions are suggested to circumvent this.

See also

Notes

Template:Reflist

References

Template:Reflist

Further reading

  1. For instance, writing about nearest neighbor search in high-dimensional data sets, Jon Kleinberg writes: "The more sophisticated algorithms typically achieve a query time that is logarithmic in n at the expense of an exponential dependence on the dimension d; indeed, even the average case analysis of heuristics such as k-d trees reveal an exponential dependence on d in the query time. Template:Citation.
  2. Template:Cite web
  3. Template:Citation
  4. Template:Citation
  5. MIT 18.S096 (Fall 2015): Topics in Mathematics of Data Science, Lecture 5, Johnson-Lindenstrauss Lemma and Gordons Theorem
  6. Template:Citation
  7. Template:Cite journal
  8. Template:Cite journal
  9. Template:Cite journal
  10. Template:Citation
  11. Template:Citation. A preliminary version of this paper was published in the Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, 2012.
  12. Template:Citation
  13. 13.0 13.1 Template:Citation
  14. 14.0 14.1 Template:Citation
  15. 15.0 15.1 15.2 Template:Citation
  16. 16.0 16.1 Template:Citation
  17. 17.0 17.1 Template:Citation
  18. Template:Citation
  19. Template:Citation
  20. Template:Citation


Cite error: <ref> tags exist for a group named "Note", but no corresponding <references group="Note"/> tag was found