Coupon collector's problem

From testwiki
Revision as of 05:41, 3 December 2024 by imported>David Eppstein (Undid revision 1260883973 by Mhym (talk) was this intended to support any content in the article? Wilf only mentions the coupon collector's problem briefly as the subject of an exercise.)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Template:Short description

Graph of number of coupons, n vs the expected number of trials (i.e., time) needed to collect them all E (T )

In probability theory, the coupon collector's problem refers to mathematical analysis of "collect all coupons and win" contests. It asks the following question: if each box of a given product (e.g., breakfast cereals) contains a coupon, and there are n different types of coupons, what is the probability that more than t boxes need to be bought to collect all n coupons? An alternative statement is: given n coupons, how many coupons do you expect you need to draw with replacement before having drawn each coupon at least once? The mathematical analysis of the problem reveals that the expected number of trials needed grows as Θ(nlog(n)).Template:Efn For example, when n = 50 it takes about 225[lower-alpha 1] trials on average to collect all 50 coupons.

Solution

Via generating functions

By definition of Stirling numbers of the second kind, the probability that exactly T draws are needed isS(T1,n1)n!nTBy manipulating the generating function of the Stirling numbers, we can explicitly calculate all moments of T:fk(x):=TS(T,k)xT=r=1kx1rxIn general, the k-th moment is (n1)!((Dxx)kfn1(x))|x=1/n, where Dx is the derivative operator d/dx. For example, the 0th moment isTS(T1,n1)n!nT=(n1)!fn1(1/n)=(n1)!×r=1n11/n1r/n=1and the 1st moment is (n1)!(Dxxfn1(x))|x=1/n, which can be explicitly evaluated to nHn, etc.

Calculating the expectation

Let time T be the number of draws needed to collect all n coupons, and let ti be the time to collect the i-th coupon after i − 1 coupons have been collected. Then T=t1++tn. Think of T and ti as random variables. Observe that the probability of collecting a Template:Em coupon is pi=n(i1)n=ni+1n. Therefore, ti has geometric distribution with expectation 1pi=nni+1. By the linearity of expectations we have:

E(T)=E(t1+t2++tn)=E(t1)+E(t2)++E(tn)=1p1+1p2++1pn=nn+nn1++n1=n(11+12++1n)=nHn.

Here Hn is the n-th harmonic number. Using the asymptotics of the harmonic numbers, we obtain:

E(T)=nHn=nlogn+γn+12+O(1/n),

where γ0.5772156649 is the Euler–Mascheroni constant.

Using the Markov inequality to bound the desired probability:

P(TcnHn)1c.

The above can be modified slightly to handle the case when we've already collected some of the coupons. Let k be the number of coupons already collected, then:

E(Tk)=E(tk+1+tk+2++tn)=n(11+12++1nk)=nHnk

And when k=0 then we get the original result.

Calculating the variance

Using the independence of random variables ti, we obtain:

Var(T)=Var(t1++tn)=Var(t1)+Var(t2)++Var(tn)=1p1p12+1p2p22++1pnpn2=(n2n2+n2(n1)2++n212)(nn+nn1++n1)=n2(112+122++1n2)n(11+12++1n)<π26n2

since π26=112+122++1n2+ (see Basel problem).

Bound the desired probability using the Chebyshev inequality:

P(|TnHn|cn)π26c2.

Tail estimates

A stronger tail estimate for the upper tail be obtained as follows. Let Zir denote the event that the i-th coupon was not picked in the first r trials. Then

P[Zir]=(11n)rer/n.

Thus, for r=βnlogn, we have P[Zir]e(βnlogn)/n=nβ. Via a union bound over the n coupons, we obtain

P[T>βnlogn]=P[iZiβnlogn]nP[Z1βnlogn]nβ+1.

Extensions and generalizations

P(T<nlogn+cn)eec, as n. which is a Gumbel distribution. A simple proof by martingales is in the next section.
  • Donald J. Newman and Lawrence Shepp gave a generalization of the coupon collector's problem when m copies of each coupon need to be collected. Let Tm be the first time m copies of each coupon are collected. They showed that the expectation in this case satisfies:
E(Tm)=nlogn+(m1)nloglogn+O(n), as n.
Here m is fixed. When m = 1 we get the earlier formula for the expectation.
  • Common generalization, also due to Erdős and Rényi:
P(Tm<nlogn+(m1)nloglogn+cn)eec/(m1)!, as n.
  • In the general case of a nonuniform probability distribution, according to Philippe Flajolet et al.[2]
E(T)=0(1i=1m(1epit))dt.
This is equal to
E(T)=q=0m1(1)m1q|J|=q11PJ,
where m denotes the number of coupons to be collected and PJ denotes the probability of getting any coupon in the set of coupons J.

See also

Notes

Template:Notelist

References

Template:Reflist Template:Refbegin

Template:Refend


Cite error: <ref> tags exist for a group named "lower-alpha", but no corresponding <references group="lower-alpha"/> tag was found