Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rⁿ, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

dist (x, Z)^{α} \leq C | f (x) | .

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

| f (x) - f (p) |^{θ} \leq c | \nabla f (x) | .

Polyak inequality

A special case of the Łojasiewicz inequality, due to Template:Ill, is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Template:Harvtxt and Template:Harvtxt.

Definitions

$f$ is a function of type $ℝ^{d} \to ℝ$ , and has a continuous derivative $\nabla f$ .

$X^{*}$ is the subset of $ℝ^{d}$ on which $f$ achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value $f^{*}$ exists, unless otherwise stated. The optimization objective is to find some point $x$ in $X^{*}$ .

$μ, L > 0$ are constants.

$\nabla f$ is $L$ -Lipschitz continuous iff

$‖ \nabla f (x) - \nabla f (y) ‖ \leq L ‖ x - y ‖, \forall x, y$

$f$ is $μ$ -strongly convex iff $f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{μ}{2} ‖ y - x ‖^{2} \forall x, y$

$f$ is $μ$ -PL (where "PL" means "Polyak-Łojasiewicz") iff $\frac{1}{2} ‖ \nabla f (x) ‖^{2} \geq μ (f (x) - f (x^{*})), \forall x$

Basic properties

Template:Math theorem

Template:Hidden begin

Template:Math proof Template:Hidden end

Gradient descent

Template:Math theorem

Template:Hidden begin

Template:Math proof Template:Hidden end

Template:Math theorem

Coordinate descent

The coordinate descent algorithm first samples a random coordinate $i_{k}$ uniformly, then perform gradient descent by $x_{k + 1} = x_{k} - η \partial_{i_{k}} f (x_{k}) e_{i_{k}}$

Template:Math theorem

Template:Hidden begin Template:Math proof Template:Hidden end

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize $f (x)$ , but we cannot sample its gradient directly. Instead, we sample a random gradient $\nabla f_{i} (x)$ , where $f_{i}$ are such that $f (x) = 𝔼_{i} [f_{i} (x)]$ For example, in typical machine learning, $x$ are the parameters of the neural network, and $f_{i} (x)$ is the loss incurred on the $i$ -th training data point, while $f (x)$ is the average loss over all training data points.

The gradient update step is $x_{k + 1} = x_{k} - η_{k} \nabla f_{i_{k}} (x_{k})$ where $η_{k} > 0$ are a sequence of learning rates (the learning rate schedule).

Template:Math theorem

Template:Hidden begin

Template:Math proof Template:Hidden end

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some $C > 0$ such that during the SG process, we have $𝔼_{i} [‖ \nabla f_{i} (x_{k}) ‖^{2}] \leq C$ for all $k = 0, 1, \dots$ , then $𝔼 [f (x_{k + 1}) - f^{*}] \leq (1 - 2 η_{k} μ) [f (x_{k}) - f^{*}] + \frac{L C η_{k}^{2}}{2}$ Similarly, if $\forall k, 𝔼_{i} [‖ \nabla f_{i} (x_{k}) - \nabla f (x_{k}) ‖^{2}] \leq C$ then $𝔼 [f (x_{k + 1}) - f^{*}] \leq (1 - μ (2 η_{k} - L η_{k}^{2})) [f (x_{k}) - f^{*}] + \frac{L C η_{k}^{2}}{2}$

Learning rate schedules

For constant learning rate schedule, with $η_{k} = η = 1 / L$ , we have $𝔼 [f (x_{k + 1}) - f^{*}] \leq (1 - μ / L) [f (x_{k}) - f^{*}] + \frac{C}{2 L}$ By induction, we have $𝔼 [f (x_{k}) - f^{*}] \leq {(1 - μ / L)}^{k} [f (x_{0}) - f^{*}] + \frac{C}{2 μ}$ We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the $C / (2 L)$ term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and $x_{k}$ starts doing a random walk in the vicinity of $X^{*}$ .

For decreasing learning rate schedule with $η_{k} = O (1 / k)$ , we have $𝔼 [f (x_{k}) - f^{*}] = O (1 / k)$ .

References

Template:Reflist

External links

Template:SpringerEOM

Łojasiewicz inequality

Contents

Polyak inequality

Definitions

Basic properties

Gradient descent

Coordinate descent

Stochastic gradient descent

Learning rate schedules

References

External links

Navigation menu

Łojasiewicz inequality

Polyak inequality

Definitions

Basic properties

Gradient descent

Coordinate descent

Stochastic gradient descent

Learning rate schedules

References

External links

Navigation menu

Search