Łojasiewicz inequality

From testwiki
Jump to navigation Jump to search

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

dist(x,Z)αC|f(x)|.

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)f(p)|θc|f(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to Template:Ill, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Template:Harvtxt and Template:Harvtxt.

Definitions

f is a function of type d, and has a continuous derivative f.

X* is the subset of d on which f achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value f* exists, unless otherwise stated. The optimization objective is to find some point x in X*.

μ,L>0 are constants.

f is L-Lipschitz continuous iff

f(x)f(y)Lxy,x,y

f is μ-strongly convex ifff(y)f(x)+f(x)T(yx)+μ2yx2x,y

f is μ-PL (where "PL" means "Polyak-Łojasiewicz") iff12f(x)2μ(f(x)f(x*)),x

Basic properties

Template:Math theorem

Template:Hidden begin

Template:Math proofTemplate:Hidden end

Gradient descent

Template:Math theorem

Template:Hidden begin

Template:Math proofTemplate:Hidden end

Template:Math theorem

Coordinate descent

The coordinate descent algorithm first samples a random coordinate ik uniformly, then perform gradient descent by xk+1=xkηikf(xk)eik

Template:Math theorem

Template:Hidden begin Template:Math proofTemplate:Hidden end

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize f(x), but we cannot sample its gradient directly. Instead, we sample a random gradient fi(x), where fi are such that f(x)=𝔼i[fi(x)] For example, in typical machine learning, x are the parameters of the neural network, and fi(x) is the loss incurred on the i -th training data point, while f(x) is the average loss over all training data points.

The gradient update step is xk+1=xkηkfik(xk) where ηk>0 are a sequence of learning rates (the learning rate schedule).

Template:Math theorem

Template:Hidden begin

Template:Math proofTemplate:Hidden end

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some C>0 such that during the SG process, we have𝔼i[fi(xk)2]C for all k=0,1,, then𝔼[f(xk+1)f*](12ηkμ)[f(xk)f*]+LCηk22Similarly, if k,𝔼i[fi(xk)f(xk)2]C then𝔼[f(xk+1)f*](1μ(2ηkLηk2))[f(xk)f*]+LCηk22

Learning rate schedules

For constant learning rate schedule, with ηk=η=1/L, we have𝔼[f(xk+1)f*](1μ/L)[f(xk)f*]+C2LBy induction, we have 𝔼[f(xk)f*](1μ/L)k[f(x0)f*]+C2μWe see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the C/(2L) term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and xk starts doing a random walk in the vicinity of X*.

For decreasing learning rate schedule with ηk=O(1/k), we have 𝔼[f(xk)f*]=O(1/k).

References

Template:Reflist