Uniform convergence in probability

From testwiki
Revision as of 18:32, 13 May 2024 by imported>Citation bot (Add: isbn, doi. | Use this bot. Report bugs. | Suggested by Dominic3203 | Category:Machine learning | #UCB_Category 45/230)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Template:Short description Template:Other Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

The law of large numbers says that, for each single event A, its empirical frequency in a sequence of independent trials converges (with high probability) to its theoretical probability. In many application however, the need arises to judge simultaneously the probabilities of events of an entire class S from one and the same sample. Moreover it, is required that the relative frequency of the events converge to the probability uniformly over the entire class of events S [1] The Uniform Convergence Theorem gives a sufficient condition for this convergence to hold. Roughly, if the event-family is sufficiently simple (its VC dimension is sufficiently small) then uniform convergence holds.

Template:TOC limit

Definitions

For a class of predicates H defined on a set X and a set of samples x=(x1,x2,,xm), where xiX, the empirical frequency of hH on x is

Q^x(h)=1m|{i:1im,h(xi)=1}|.

The theoretical probability of hH is defined as QP(h)=P{yX:h(y)=1}.

The Uniform Convergence Theorem states, roughly, that if H is "simple" and we draw samples independently (with replacement) from X according to any distribution P, then with high probability, the empirical frequency will be close to its expected value, which is the theoretical probability.[2]

Here "simple" means that the Vapnik–Chervonenkis dimension of the class H is small relative to the size of the sample. In other words, a sufficiently simple collection of functions behaves roughly the same on a small random sample as it does on the distribution as a whole.

The Uniform Convergence Theorem was first proved by Vapnik and Chervonenkis[1] using the concept of growth function.

Uniform convergence theorem

The statement of the uniform convergence theorem is as follows:[3]

If H is a set of {0,1}-valued functions defined on a set X and P is a probability distribution on X then for ε>0 and m a positive integer, we have:

Pm{|QP(h)Qx^(h)|ε for some hH}4ΠH(2m)eε2m/8.
where, for any xXm,,
QP(h)=P{(yX:h(y)=1},
Q^x(h)=1m|{i:1im,h(xi)=1}|
and |x|=m. Pm indicates that the probability is taken over x consisting of m i.i.d. draws from the distribution P.
ΠH is defined as: For any {0,1}-valued functions H over X and DX,
ΠH(D)={hD:hH}.

And for any natural number m, the shattering number ΠH(m) is defined as:

ΠH(m)=max|{hD:|D|=m,hH}|.

From the point of Learning Theory one can consider H to be the Concept/Hypothesis class defined over the instance set X. Before getting into the details of the proof of the theorem we will state Sauer's Lemma which we will need in our proof.

Sauer–Shelah lemma

The Sauer–Shelah lemma[4] relates the shattering number Πh(m) to the VC Dimension.

Lemma: ΠH(m)(emd)d, where d is the VC Dimension of the concept class H.

Corollary: ΠH(m)md.

Proof of uniform convergence theorem

[1] and [3] are the sources of the proof below. Before we get into the details of the proof of the Uniform Convergence Theorem we will present a high level overview of the proof.

  1. Symmetrization: We transform the problem of analyzing |QP(h)Q^x(h)|ε into the problem of analyzing |Q^r(h)Q^s(h)|ε/2, where r and s are i.i.d samples of size m drawn according to the distribution P. One can view r as the original randomly drawn sample of length m, while s may be thought as the testing sample which is used to estimate QP(h).
  2. Permutation: Since r and s are picked identically and independently, so swapping elements between them will not change the probability distribution on r and s. So, we will try to bound the probability of |Q^r(h)Q^s(h)|ε/2 for some hH by considering the effect of a specific collection of permutations of the joint sample x=r||s. Specifically, we consider permutations σ(x) which swap xi and xm+i in some subset of 1,2,...,m. The symbol r||s means the concatenation of r and s.Template:Citation needed
  3. Reduction to a finite class: We can now restrict the function class H to a fixed joint sample and hence, if H has finite VC Dimension, it reduces to the problem to one involving a finite function class.

We present the technical details of the proof.

Symmetrization

Lemma: Let V={xXm:|QP(h)Q^x(h)|ε for some hH} and

R={(r,s)Xm×Xm:|Qr^(h)Q^s(h)|ε/2 for some hH}.

Then for m2ε2, Pm(V)2P2m(R).

Proof: By the triangle inequality,
if |QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2 then |Q^r(h)Q^s(h)|ε/2.

Therefore,

P2m(R)P2m{hH,|QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2}=VPm{s:hH,|QP(h)Q^r(h)|ε and |QP(h)Q^s(h)|ε/2}dPm(r)=A

since r and s are independent.

Now for rV fix an hH such that |QP(h)Q^r(h)|ε. For this h, we shall show that

Pm{|QP(h)Q^s(h)|ε2}12.

Thus for any rV, APm(V)2 and hence P2m(R)Pm(V)2. And hence we perform the first step of our high level idea.

Notice, mQ^s(h) is a binomial random variable with expectation mQP(h) and variance mQP(h)(1QP(h)). By Chebyshev's inequality we get

Pm{|QP(h)Qs(h)^|>ε2}mQP(h)(1QP(h))(εm/2)21ε2m12

for the mentioned bound on m. Here we use the fact that x(1x)1/4 for x.

Permutations

Let Γm be the set of all permutations of {1,2,3,,2m} that swaps i and m+i i in some subset of {1,2,3,,2m}.

Lemma: Let R be any subset of X2m and P any probability distribution on X. Then,

P2m(R)=E[Pr[σ(x)R]]maxxX2m(Pr[σ(x)R]),

where the expectation is over x chosen according to P2m, and the probability is over σ chosen uniformly from Γm.

Proof: For any σΓm,

P2m(R)=P2m{x:σ(x)R}

(since coordinate permutations preserve the product distribution P2m.)

P2m(R)=X2m1R(x)dP2m(x)=1|Γm|σΓmX2m1R(σ(x))dP2m(x)=X2m1|Γm|σΓm1R(σ(x))dP2m(x)(because |Γm| is finite)=X2mPr[σ(x)R]dP2m(x)(the expectation)maxxX2m(Pr[σ(x)R]).

The maximum is guaranteed to exist since there is only a finite set of values that probability under a random permutation can take.

Reduction to a finite class

Lemma: Basing on the previous lemma,

maxxX2m(Pr[σ(x)R])4ΠH(2m)eε2m/8.

Proof: Let us define x=(x1,x2,,x2m) and t=|H|x| which is at most ΠH(2m). This means there are functions h1,h2,,htH such that for any hH,i between 1 and t with hi(xk)=h(xk) for 1k2m.

We see that σ(x)R iff for some h in H satisfies, |1m|{1im:h(xσi)=1}|1m|{m+1i2m:h(xσi)=1}||ε2. Hence if we define wij=1 if hj(xi)=1 and wij=0 otherwise.

For 1im and 1jt, we have that σ(x)R iff for some j in 1,,t satisfies |1m(iwσ(i)jiwσ(m+i)j)|ε2. By union bound we get

Pr[σ(x)R]tmax(Pr[|1m(iwσijiwσm+ij)|ε2])
ΠH(2m)max(Pr[|1m(iwσijiwσm+ij)|ε2]).

Since, the distribution over the permutations σ is uniform for each i, so wσijwσm+ij equals ±|wijwm+ij|, with equal probability.

Thus,

Pr[|1m(i(wσijwσm+ij))|ε2]=Pr[|1m(i|wijwm+ij|βi)|ε2],

where the probability on the right is over βi and both the possibilities are equally likely. By Hoeffding's inequality, this is at most 2emε2/8.

Finally, combining all the three parts of the proof we get the Uniform Convergence Theorem.

References

Template:Reflist