Łojasiewicz inequality

☆ Save On Wikipedia ↗

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U  R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

dist ⁡ ( x , Z ) α ≤ C | f ( x ) | . {\displaystyle \operatorname {dist} (x,Z)^{\alpha }\leq C|f(x)|.} {\displaystyle \operatorname {dist} (x,Z)^{\alpha }\leq C|f(x)|.}

Here, α {\displaystyle \alpha } {\displaystyle \alpha } can be small.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p  U there is a possibly smaller open neighborhood W of p and constants θ  (0,1) and c > 0 such that

| f ( x ) − f ( p ) | θ ≤ c | ∇ f ( x ) | . {\displaystyle |f(x)-f(p)|^{\theta }\leq c|\nabla f(x)|.} {\displaystyle |f(x)-f(p)|^{\theta }\leq c|\nabla f(x)|.}

The proof for the one-dimensional case uses the order of vanishing: if f(x) − f(p) = (xp)k · g(x) with g(p) ≠ 0, then θ = (k − 1)/k works for k ≥ 2, and θ = 1/2 works for k = 1.

Polyak inequality

A special case of the Łojasiewicz inequality, due to Polyak (see condition C in [1]), is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

Definitions

f {\textstyle f} {\textstyle f} is a function of type R d → R {\textstyle \mathbb {R} ^{d}\to \mathbb {R} } {\textstyle \mathbb {R} ^{d}\to \mathbb {R} }, and has a continuous derivative ∇ f {\displaystyle \nabla f} {\displaystyle \nabla f}.

X ∗ {\displaystyle X^{*}} {\displaystyle X^{*}} is the subset of R d {\displaystyle \mathbb {R} ^{d}} {\displaystyle \mathbb {R} ^{d}} on which f {\displaystyle f} {\displaystyle f} achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value f ∗ {\displaystyle f^{*}} {\displaystyle f^{*}} exists, unless otherwise stated. The optimization objective is to find some point x {\displaystyle x} {\displaystyle x} in X ∗ {\displaystyle X^{*}} {\displaystyle X^{*}}.

μ , L > 0 {\textstyle \mu ,L>0} {\textstyle \mu ,L>0} are constants.

∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L {\displaystyle L} {\displaystyle L}-Lipschitz continuous if

‖ ∇ f ( x ) − ∇ f ( y ) ‖ ≤ L ‖ x − y ‖ , ∀ x , y {\displaystyle \|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad \forall x,y} {\displaystyle \|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad \forall x,y}

f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu }-strongly convex iff f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + μ 2 ‖ y − x ‖ 2 ∀ x , y {\displaystyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}\quad \forall x,y} {\displaystyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}\quad \forall x,y}

f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu }-PL (where "PL" means "Polyak-Łojasiewicz") iff 1 2 ‖ ∇ f ( x ) ‖ 2 ≥ μ ( f ( x ) − f ( x ∗ ) ) , ∀ x {\displaystyle {\frac {1}{2}}\|\nabla f(x)\|^{2}\geq \mu \left(f(x)-f(x^{*})\right),\quad \forall x} {\displaystyle {\frac {1}{2}}\|\nabla f(x)\|^{2}\geq \mu \left(f(x)-f(x^{*})\right),\quad \forall x}

Basic properties

Theorem1. If f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu }-PL, then it is invex.

2. If ∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L-Lipschitz continuous, then f ( y ) ≤ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ + L 2 ‖ y − x ‖ 2 {\displaystyle f(y)\leq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}} {\displaystyle f(y)\leq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}}

3. If f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu }-strongly convex then it is μ {\textstyle \mu } {\textstyle \mu }-PL.

4. If g {\textstyle g} {\textstyle g} is μ {\textstyle \mu } {\textstyle \mu }-strongly convex, and A {\textstyle A} {\textstyle A} is linear, then f := g ∘ A {\textstyle f:=g\circ A} {\textstyle f:=g\circ A} is ( μ σ 2 ) {\textstyle (\mu \sigma ^{2})} {\textstyle (\mu \sigma ^{2})}-PL, where σ {\textstyle \sigma } {\textstyle \sigma } is the smallest nonzero singular value of A {\textstyle A} {\textstyle A}.

5. (quadratic growth) If f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu } -PL, x {\textstyle x} {\textstyle x} is a point, and x ∗ {\textstyle x^{*}} {\textstyle x^{*}} is the point on the optimum set that is closest to x {\textstyle x} {\textstyle x} in L2-norm, then f ( x ) ≥ f ( x ∗ ) + μ 2 ‖ x − x ∗ ‖ 2 2 {\displaystyle f(x)\geq f\left(x^{*}\right)+{\frac {\mu }{2}}\left\|x-x^{*}\right\|_{2}^{2}} {\displaystyle f(x)\geq f\left(x^{*}\right)+{\frac {\mu }{2}}\left\|x-x^{*}\right\|_{2}^{2}}

Proof
Proof

1. By definition, every stationary point is a global minimum.

2. Set g ( t ) = f ( x + t ( y − x ) ) {\textstyle g(t)=f(x+t(y-x))} {\textstyle g(t)=f(x+t(y-x))} for t ∈ [ 0 , 1 ] {\textstyle t\in [0,1]} {\textstyle t\in [0,1]} and use the L {\textstyle L} {\textstyle L} -Lipschitz continuity to show that f ( y ) − f ( x ) = g ( 1 ) − g ( 0 ) = ∫ 0 1 g ′ ( t ) = ⟨ ∫ 0 1 ∇ f ( x + t ( y − x ) ) d t , y − x ⟩ ≤ ⟨ ∇ f ( x ) , y − x ⟩ + L 2 ‖ y − x ‖ 2 {\textstyle f(y)-f(x)=g(1)-g(0)=\int _{0}^{1}g'(t)=\langle \int _{0}^{1}\nabla f(x+t(y-x))dt,y-x\rangle \leq \langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}} {\textstyle f(y)-f(x)=g(1)-g(0)=\int _{0}^{1}g'(t)=\langle \int _{0}^{1}\nabla f(x+t(y-x))dt,y-x\rangle \leq \langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}}.

3. By definition, f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + μ 2 ‖ y − x ‖ 2 {\textstyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}} {\textstyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}}. Now, minimize the left side, we have f ( x ∗ ) ≥ f ( x ) + ∇ f ( x ) T ( x ∗ − x ) + μ 2 ‖ x ∗ − x ‖ 2 {\displaystyle f(x^{*})\geq f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}} {\displaystyle f(x^{*})\geq f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}} then minimize the right side, we have f ( x ) + ∇ f ( x ) T ( x ∗ − x ) + μ 2 ‖ x ∗ − x ‖ 2 ≥ f ( x ) − 1 2 μ ‖ ∇ f ( x ) ‖ 2 {\displaystyle f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}\geq f(x)-{\frac {1}{2\mu }}\|\nabla f(x)\|^{2}} {\displaystyle f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}\geq f(x)-{\frac {1}{2\mu }}\|\nabla f(x)\|^{2}} Combining the two, we have the μ {\textstyle \mu } {\textstyle \mu }-PL inequality.

f ( x k ) − f ( x ∗ ) ≤ ( 1 − μ / L ) k ( f ( x 0 ) − f ( x ∗ ) ) {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)} {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}

4. g ( A y ) ≥ g ( A x ) + ⟨ ∇ g ( A x ) , A y − A x ⟩ + μ 2 ‖ A y − A x ‖ 2 {\displaystyle g(Ay)\geq g(Ax)+\langle \nabla g(Ax),Ay-Ax\rangle +{\frac {\mu }{2}}\|Ay-Ax\|^{2}} {\displaystyle g(Ay)\geq g(Ax)+\langle \nabla g(Ax),Ay-Ax\rangle +{\frac {\mu }{2}}\|Ay-Ax\|^{2}}

Now, since ∇ f ( x ) = A T ∇ g ( A x ) {\textstyle \nabla f(x)=A^{T}\nabla g(Ax)} {\textstyle \nabla f(x)=A^{T}\nabla g(Ax)}, we have f ( y ) ≥ f ( x ) + ⟨ ∇ f ( x ) , y − x ⟩ + μ 2 ‖ A ( y − x ) ‖ 2 {\displaystyle f(y)\geq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {\mu }{2}}\|A(y-x)\|^{2}} {\displaystyle f(y)\geq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {\mu }{2}}\|A(y-x)\|^{2}}

Set y {\textstyle y} {\textstyle y} to be the projection of x {\textstyle x} {\textstyle x} to the optimum subspace, then we have ‖ A ( y − x ) ‖ ≥ σ ‖ y − x ‖ {\textstyle \|A(y-x)\|\geq \sigma \|y-x\|} {\textstyle \|A(y-x)\|\geq \sigma \|y-x\|}. Thus, we have f ( y ) − f ( x ) ≥ ⟨ ∇ f ( x ) , y − x ⟩ + μ σ 2 2 ‖ y − x ‖ 2 {\displaystyle f(y)-f(x)\geq \langle \nabla f(x),y-x\rangle +{\frac {\mu \sigma ^{2}}{2}}\|y-x\|^{2}} {\displaystyle f(y)-f(x)\geq \langle \nabla f(x),y-x\rangle +{\frac {\mu \sigma ^{2}}{2}}\|y-x\|^{2}} Vary the y {\textstyle y} {\textstyle y} on the right side to minimize the right side, we have the desired result.

5. Let g ( x ) := f ( x ) − f ∗ {\textstyle g(x):={\sqrt {f(x)-f^{*}}}} {\textstyle g(x):={\sqrt {f(x)-f^{*}}}}. For any x ∉ X ∗ {\textstyle x\not \in X^{*}} {\textstyle x\not \in X^{*}}, we have ∇ g ( x ) = ∇ f ( x ) 2 f ( x ) − f ∗ {\displaystyle \nabla g(x)={\frac {\nabla f(x)}{2{\sqrt {f(x)-f^{*}}}}}} {\displaystyle \nabla g(x)={\frac {\nabla f(x)}{2{\sqrt {f(x)-f^{*}}}}}} so by μ {\textstyle \mu } {\textstyle \mu } -PL,
‖ ∇ g ( x ) ‖ 2 ≥ μ / 2 {\displaystyle \|\nabla g(x)\|^{2}\geq \mu /2} {\displaystyle \|\nabla g(x)\|^{2}\geq \mu /2}

In particular, we see that ∇ g {\textstyle \nabla g} {\textstyle \nabla g} is a vector field on R d ∖ X ∗ {\textstyle \mathbb {R} ^{d}\setminus X^{*}} {\textstyle \mathbb {R} ^{d}\setminus X^{*}} with size at least μ / 2 {\textstyle {\sqrt {\mu /2}}} {\textstyle {\sqrt {\mu /2}}}. Define a gradient flow along ∇ g {\textstyle \nabla g} {\textstyle \nabla g} with constant unit velocity, starting at x ( 0 ) = x {\textstyle x(0)=x} {\textstyle x(0)=x}: x ( 0 ) = x , x ˙ ( t ) = ∇ g ‖ ∇ g ‖ {\displaystyle x(0)=x,\quad {\dot {x}}(t)={\frac {\nabla g}{\|\nabla g\|}}} {\displaystyle x(0)=x,\quad {\dot {x}}(t)={\frac {\nabla g}{\|\nabla g\|}}}

Because g {\textstyle g} {\textstyle g} is bounded below by 0 {\textstyle 0} {\textstyle 0}, and ‖ ∇ g ‖ ≥ μ / 2 {\textstyle \|\nabla g\|\geq {\sqrt {\mu /2}}} {\textstyle \|\nabla g\|\geq {\sqrt {\mu /2}}}, the gradient flow terminates on the zero set X ∗ {\textstyle X^{*}} {\textstyle X^{*}} at a finite time T ≤ g ( x ) / μ / 2 {\displaystyle T\leq g(x)/{\sqrt {\mu /2}}} {\displaystyle T\leq g(x)/{\sqrt {\mu /2}}} The path length is T {\textstyle T} {\textstyle T}, since the velocity is constantly 1.

Since x ( T ) {\textstyle x(T)} {\textstyle x(T)} is on the zero set, and x ∗ {\textstyle x^{*}} {\textstyle x^{*}} is the point closest to x {\textstyle x} {\textstyle x}, we have ‖ x ∗ − x ‖ ≤ T ≤ g ( x ) / μ / 2 {\displaystyle \|x^{*}-x\|\leq T\leq g(x)/{\sqrt {\mu /2}}} {\displaystyle \|x^{*}-x\|\leq T\leq g(x)/{\sqrt {\mu /2}}} which is the desired result.

Gradient descent

Theorem (linear convergence of gradient descent)If f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu }-PL and ∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L {\textstyle L} {\textstyle L}-Lipschitz, then gradient descent with constant step size η {\textstyle \eta } {\textstyle \eta } x k + 1 = x k − η ∇ f ( x k ) {\displaystyle x_{k+1}=x_{k}-\eta \nabla f(x_{k})} {\displaystyle x_{k+1}=x_{k}-\eta \nabla f(x_{k})}converges linearly as f ( x k ) − f ( x ∗ ) ≤ ( 1 − 2 μ η ( 1 − L η / 2 ) ) k ( f ( x 0 ) − f ( x ∗ ) ) , η ∈ ( 0 , 2 / L ) {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right),\quad \eta \in (0,2/L)} {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right),\quad \eta \in (0,2/L)}

The convergence is the fastest when η = 1 / L {\textstyle \eta =1/L} {\textstyle \eta =1/L}, at which point f ( x k ) − f ( x ∗ ) ≤ ( 1 − μ / L ) k ( f ( x 0 ) − f ( x ∗ ) ) {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)} {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}

Proof
Proof

Since ∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L {\textstyle L} {\textstyle L} -Lipschitz, we have the parabolic upper bound f ( x k + 1 ) ≤ f ( x k ) + ⟨ ∇ f ( x k ) , x k + 1 − x k ⟩ + L 2 ‖ x k + 1 − x k ‖ 2 {\displaystyle f(x_{k+1})\leq f(x_{k})+\langle \nabla f(x_{k}),x_{k+1}-x_{k}\rangle +{\frac {L}{2}}\|x_{k+1}-x_{k}\|^{2}} {\displaystyle f(x_{k+1})\leq f(x_{k})+\langle \nabla f(x_{k}),x_{k+1}-x_{k}\rangle +{\frac {L}{2}}\|x_{k+1}-x_{k}\|^{2}}

Plugging in the gradient descent step, f ( x k + 1 ) − f ( x k ) ≤ ⟨ ∇ f ( x k ) , − η ∇ f ( x k ) ⟩ + L 2 ‖ − η ∇ f ( x k ) ‖ 2 = ( L η 2 / 2 − η ) ‖ ∇ f ( x k ) ‖ 2 ≤ 2 μ ( L η 2 / 2 − η ) ( f ( x k ) − f ( x ∗ ) ) {\displaystyle {\begin{aligned}f(x_{k+1})-f(x_{k})&\leq \langle \nabla f(x_{k}),-\eta \nabla f(x_{k})\rangle +{\frac {L}{2}}\|-\eta \nabla f(x_{k})\|^{2}\\&=(L\eta ^{2}/2-\eta )\|\nabla f(x_{k})\|^{2}\\&\leq 2\mu (L\eta ^{2}/2-\eta )\left(f(x_{k})-f(x^{*})\right)\end{aligned}}} {\displaystyle {\begin{aligned}f(x_{k+1})-f(x_{k})&\leq \langle \nabla f(x_{k}),-\eta \nabla f(x_{k})\rangle +{\frac {L}{2}}\|-\eta \nabla f(x_{k})\|^{2}\\&=(L\eta ^{2}/2-\eta )\|\nabla f(x_{k})\|^{2}\\&\leq 2\mu (L\eta ^{2}/2-\eta )\left(f(x_{k})-f(x^{*})\right)\end{aligned}}}

Thus, f ( x k ) − f ( x ∗ ) ≤ ( 1 − 2 μ η ( 1 − L η / 2 ) ) k ( f ( x 0 ) − f ( x ∗ ) ) {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)} {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}

Corollary1. x k {\textstyle x_{k}} {\textstyle x_{k}} converges to the optimum set X ∗ {\textstyle X^{*}} {\textstyle X^{*}} at a rate of ( 1 − μ η ( 2 − L η ) ) {\textstyle \left(1-\mu \eta (2-L\eta )\right)} {\textstyle \left(1-\mu \eta (2-L\eta )\right)}.

2. If f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu } -PL, not constant, and ∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L {\textstyle L} {\textstyle L} -Lipschitz, then L ≥ μ {\textstyle L\geq \mu } {\textstyle L\geq \mu }.

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

f ( x k ) − f ( x ∗ ) ≤ ( 1 − μ / L ) k ( f ( x 0 ) − f ( x ∗ ) ) {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)} {\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}

Coordinate descent

The coordinate descent algorithm first samples a random coordinate i k {\textstyle i_{k}} {\textstyle i_{k}} uniformly, then perform gradient descent by x k + 1 = x k − η ∂ i k f ( x k ) e i k {\displaystyle x_{k+1}=x_{k}-\eta \partial _{i_{k}}f(x_{k})e_{i_{k}}} {\displaystyle x_{k+1}=x_{k}-\eta \partial _{i_{k}}f(x_{k})e_{i_{k}}}

TheoremAssume that f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu } -PL, and that ∇ f {\textstyle \nabla f} {\textstyle \nabla f} is L {\textstyle L} {\textstyle L} -Lipschitz at each coordinate, meaning that | ∂ i f ( x + t e i ) − ∂ i f ( x ) | ≤ L | t | {\displaystyle |\partial _{i}f(x+te_{i})-\partial _{i}f(x)|\leq L|t|} {\displaystyle |\partial _{i}f(x+te_{i})-\partial _{i}f(x)|\leq L|t|}Then, E [ f ( x k ) − f ( x ∗ ) ] {\textstyle \mathbb {E} [f(x_{k})-f(x^{*})]} {\textstyle \mathbb {E} [f(x_{k})-f(x^{*})]} converges linearly for all η ∈ ( 0 , 2 / L ) {\textstyle \eta \in (0,2/L)} {\textstyle \eta \in (0,2/L)} by E [ f ( x k ) − f ( x ∗ ) ] ≤ ( 1 − μ η ( 2 − L η ) d ) k ( f ( x 0 ) − f ( x ∗ ) ) {\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)^{k}(f(x_{0})-f(x^{*}))} {\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)^{k}(f(x_{0})-f(x^{*}))}

Proof
Proof

By the same argument, f ( x k + 1 ) ≤ f ( x k ) + ( L η 2 / 2 − η ) ( ∂ i k f ( x k ) ) 2 {\displaystyle f(x_{k+1})\leq f(x_{k})+(L\eta ^{2}/2-\eta )(\partial _{i_{k}}f(x_{k}))^{2}} {\displaystyle f(x_{k+1})\leq f(x_{k})+(L\eta ^{2}/2-\eta )(\partial _{i_{k}}f(x_{k}))^{2}}

Take expectation with respect to i k {\textstyle i_{k}} {\textstyle i_{k}}, we have E [ f ( x k + 1 ) ] ≤ f ( x k ) + L η 2 / 2 − η d ‖ ∇ f ( x k ) ‖ 2 {\displaystyle \mathbb {E} [f(x_{k+1})]\leq f(x_{k})+{\frac {L\eta ^{2}/2-\eta }{d}}\|\nabla f(x_{k})\|^{2}} {\displaystyle \mathbb {E} [f(x_{k+1})]\leq f(x_{k})+{\frac {L\eta ^{2}/2-\eta }{d}}\|\nabla f(x_{k})\|^{2}}

Plug in the μ {\textstyle \mu } {\textstyle \mu } -PL inequality, we have E [ f ( x k ) − f ( x ∗ ) ] ≤ ( 1 − μ η ( 2 − L η ) d ) ( f ( x k ) − f ( x ∗ ) ) {\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)(f(x_{k})-f(x^{*}))} {\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)(f(x_{k})-f(x^{*}))}Iterating the process, we have the desired result.

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize f ( x ) {\textstyle f(x)} {\textstyle f(x)}, but we cannot sample its gradient directly. Instead, we sample a random gradient ∇ f i ( x ) {\textstyle \nabla f_{i}(x)} {\textstyle \nabla f_{i}(x)}, where f i {\textstyle f_{i}} {\textstyle f_{i}} are such that f ( x ) = E i [ f i ( x ) ] {\displaystyle f(x)=\mathbb {E} _{i}[f_{i}(x)]} {\displaystyle f(x)=\mathbb {E} _{i}[f_{i}(x)]} For example, in typical machine learning, x {\textstyle x} {\textstyle x} are the parameters of the neural network, and f i ( x ) {\textstyle f_{i}(x)} {\textstyle f_{i}(x)} is the loss incurred on the i {\textstyle i} {\textstyle i} -th training data point, while f ( x ) {\textstyle f(x)} {\textstyle f(x)} is the average loss over all training data points.

The gradient update step is x k + 1 = x k − η k ∇ f i k ( x k ) {\displaystyle x_{k+1}=x_{k}-\eta _{k}\nabla f_{i_{k}}(x_{k})} {\displaystyle x_{k+1}=x_{k}-\eta _{k}\nabla f_{i_{k}}(x_{k})} where η k > 0 {\textstyle \eta _{k}>0} {\textstyle \eta _{k}>0} are a sequence of learning rates (the learning rate schedule).

TheoremIf each ∇ f i {\textstyle \nabla f_{i}} {\textstyle \nabla f_{i}} is L {\textstyle L} {\textstyle L} -Lipschitz, f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu } -PL, and f {\textstyle f} {\textstyle f} has global minimum f ∗ {\textstyle f^{*}} {\textstyle f^{*}}, then E [ f ( x k + 1 ) − f ∗ ] ≤ ( 1 − 2 η k μ ) [ f ( x k ) − f ∗ ] + L η k 2 2 E i [ ‖ ∇ f i ( x k ) ‖ 2 ] {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]} {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]}We can also write it using the variance of gradient L2 norm: E [ f ( x k + 1 ) − f ∗ ] ≤ ( 1 − μ ( 2 η k − L η k 2 ) ) [ f ( x k ) − f ∗ ] + L η k 2 2 E i [ ‖ ∇ f i ( x k ) − ∇ f ( x k ) ‖ 2 ] {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]} {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]}

Proof
Proof

Because all ∇ f i {\textstyle \nabla f_{i}} {\textstyle \nabla f_{i}} are L {\textstyle L} {\textstyle L} -Lipschitz, so is ∇ f {\textstyle \nabla f} {\textstyle \nabla f}. We thus have f ( x k + 1 ) ≤ f ( x k ) − η k ⟨ ∇ f ( x k ) , ∇ f i k ( x k ) ⟩ + L η k 2 2 ‖ ∇ f i k ( x k ) ‖ 2 {\displaystyle f(x_{k+1})\leq f(x_{k})-\eta _{k}\langle \nabla f(x_{k}),\nabla f_{i_{k}}(x_{k})\rangle +{\frac {L\eta _{k}^{2}}{2}}\|\nabla f_{i_{k}}(x_{k})\|^{2}} {\displaystyle f(x_{k+1})\leq f(x_{k})-\eta _{k}\langle \nabla f(x_{k}),\nabla f_{i_{k}}(x_{k})\rangle +{\frac {L\eta _{k}^{2}}{2}}\|\nabla f_{i_{k}}(x_{k})\|^{2}}

Now, take the expectation over i k {\textstyle i_{k}} {\textstyle i_{k}}, and use the fact that f {\textstyle f} {\textstyle f} is μ {\textstyle \mu } {\textstyle \mu } -PL. This gives the first equation.

The second equation is shown similarly by noting that E i [ ‖ ∇ f i ( x k ) ‖ 2 ] = E i [ ‖ ∇ f i ( x k ) − ∇ f ( x k ) ‖ 2 ] + ‖ ∇ f ( x k ) ‖ 2 {\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]=\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]+\|\nabla f(x_{k})\|^{2}} {\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]=\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]+\|\nabla f(x_{k})\|^{2}}

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some C > 0 {\textstyle C>0} {\textstyle C>0} such that during the SG process, we have E i [ ‖ ∇ f i ( x k ) ‖ 2 ] ≤ C {\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]\leq C} {\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]\leq C} for all k = 0 , 1 , … {\textstyle k=0,1,\dots } {\textstyle k=0,1,\dots }, then E [ f ( x k + 1 ) − f ∗ ] ≤ ( 1 − 2 η k μ ) [ f ( x k ) − f ∗ ] + L C η k 2 2 {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}} {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}}Similarly, if ∀ k , E i [ ‖ ∇ f i ( x k ) − ∇ f ( x k ) ‖ 2 ] ≤ C {\displaystyle \forall k,\quad \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]\leq C} {\displaystyle \forall k,\quad \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]\leq C} then E [ f ( x k + 1 ) − f ∗ ] ≤ ( 1 − μ ( 2 η k − L η k 2 ) ) [ f ( x k ) − f ∗ ] + L C η k 2 2 {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}} {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}}

Learning rate schedules

For constant learning rate schedule, with η k = η = 1 / L {\textstyle \eta _{k}=\eta =1/L} {\textstyle \eta _{k}=\eta =1/L}, we have E [ f ( x k + 1 ) − f ∗ ] ≤ ( 1 − μ / L ) [ f ( x k ) − f ∗ ] + C 2 L {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu /L\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {C}{2L}}} {\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu /L\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {C}{2L}}}By induction, we have E [ f ( x k ) − f ∗ ] ≤ ( 1 − μ / L ) k [ f ( x 0 ) − f ∗ ] + C 2 μ {\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]\leq \left(1-\mu /L\right)^{k}\left[f\left(x_{0}\right)-f^{*}\right]+{\frac {C}{2\mu }}} {\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]\leq \left(1-\mu /L\right)^{k}\left[f\left(x_{0}\right)-f^{*}\right]+{\frac {C}{2\mu }}}We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the C / ( 2 L ) {\textstyle C/(2L)} {\textstyle C/(2L)} term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and x k {\textstyle x_{k}} {\textstyle x_{k}} starts doing a random walk in the vicinity of X ∗ {\textstyle X^{*}} {\textstyle X^{*}}.

For decreasing learning rate schedule with η k = O ( 1 / k ) {\textstyle \eta _{k}=O(1/k)} {\textstyle \eta _{k}=O(1/k)}, we have E [ f ( x k ) − f ∗ ] = O ( 1 / k ) {\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]=O(1/k)} {\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]=O(1/k)}.

References

  1. Polyak, B. T. (1 January 1963). "Gradient methods for the minimisation of functionals". USSR Computational Mathematics and Mathematical Physics. 3 (4): 864–878. doi:10.1016/0041-5553(63)90382-3. Retrieved 27 December 2025.