In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K
-
dist
(
x
,
Z
)
α
≤
C
|
f
(
x
)
|
.
{\displaystyle \operatorname {dist} (x,Z)^{\alpha }\leq C|f(x)|.}
Here,
α
{\displaystyle \alpha }
can be small.
The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that
-
|
f
(
x
)
−
f
(
p
)
|
θ
≤
c
|
∇
f
(
x
)
|
.
{\displaystyle |f(x)-f(p)|^{\theta }\leq c|\nabla f(x)|.}
The proof for the one-dimensional case uses the order of vanishing: if f(x) − f(p) = (x − p)k · g(x) with g(p) ≠ 0, then θ = (k − 1)/k works for k ≥ 2, and θ = 1/2 works for k = 1.
Polyak inequality
A special case of the Łojasiewicz inequality, due to Polyak (see condition C in [1]), is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).
Definitions
f
{\textstyle f}
is a function of type
R
d
→
R
{\textstyle \mathbb {R} ^{d}\to \mathbb {R} }
, and has a continuous derivative
∇
f
{\displaystyle \nabla f}
.
X
∗
{\displaystyle X^{*}}
is the subset of
R
d
{\displaystyle \mathbb {R} ^{d}}
on which
f
{\displaystyle f}
achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value
f
∗
{\displaystyle f^{*}}
exists, unless otherwise stated. The optimization objective is to find some point
x
{\displaystyle x}
in
X
∗
{\displaystyle X^{*}}
.
μ
,
L
>
0
{\textstyle \mu ,L>0}
are constants.
∇
f
{\textstyle \nabla f}
is
L
{\displaystyle L}
-Lipschitz continuous if
‖
∇
f
(
x
)
−
∇
f
(
y
)
‖
≤
L
‖
x
−
y
‖
,
∀
x
,
y
{\displaystyle \|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad \forall x,y}
f
{\textstyle f}
is
μ
{\textstyle \mu }
-strongly convex iff
f
(
y
)
≥
f
(
x
)
+
∇
f
(
x
)
T
(
y
−
x
)
+
μ
2
‖
y
−
x
‖
2
∀
x
,
y
{\displaystyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}\quad \forall x,y}
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL (where "PL" means "Polyak-Łojasiewicz") iff
1
2
‖
∇
f
(
x
)
‖
2
≥
μ
(
f
(
x
)
−
f
(
x
∗
)
)
,
∀
x
{\displaystyle {\frac {1}{2}}\|\nabla f(x)\|^{2}\geq \mu \left(f(x)-f(x^{*})\right),\quad \forall x}
Basic properties
Theorem—1. If
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL, then it is invex.
2. If
∇
f
{\textstyle \nabla f}
is L-Lipschitz continuous, then
f
(
y
)
≤
f
(
x
)
+
⟨
∇
f
(
x
)
,
y
−
x
⟩
+
L
2
‖
y
−
x
‖
2
{\displaystyle f(y)\leq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}}
3. If
f
{\textstyle f}
is
μ
{\textstyle \mu }
-strongly convex then it is
μ
{\textstyle \mu }
-PL.
4. If
g
{\textstyle g}
is
μ
{\textstyle \mu }
-strongly convex, and
A
{\textstyle A}
is linear, then
f
:=
g
∘
A
{\textstyle f:=g\circ A}
is
(
μ
σ
2
)
{\textstyle (\mu \sigma ^{2})}
-PL, where
σ
{\textstyle \sigma }
is the smallest nonzero singular value of
A
{\textstyle A}
.
5. (quadratic growth) If
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL,
x
{\textstyle x}
is a point, and
x
∗
{\textstyle x^{*}}
is the point on the optimum set that is closest to
x
{\textstyle x}
in L2-norm, then
f
(
x
)
≥
f
(
x
∗
)
+
μ
2
‖
x
−
x
∗
‖
2
2
{\displaystyle f(x)\geq f\left(x^{*}\right)+{\frac {\mu }{2}}\left\|x-x^{*}\right\|_{2}^{2}}
1. By definition, every stationary point is a global minimum.
2. Set
g
(
t
)
=
f
(
x
+
t
(
y
−
x
)
)
{\textstyle g(t)=f(x+t(y-x))}
for
t
∈
[
0
,
1
]
{\textstyle t\in [0,1]}
and use the
L
{\textstyle L}
-Lipschitz continuity to show that
f
(
y
)
−
f
(
x
)
=
g
(
1
)
−
g
(
0
)
=
∫
0
1
g
′
(
t
)
=
⟨
∫
0
1
∇
f
(
x
+
t
(
y
−
x
)
)
d
t
,
y
−
x
⟩
≤
⟨
∇
f
(
x
)
,
y
−
x
⟩
+
L
2
‖
y
−
x
‖
2
{\textstyle f(y)-f(x)=g(1)-g(0)=\int _{0}^{1}g'(t)=\langle \int _{0}^{1}\nabla f(x+t(y-x))dt,y-x\rangle \leq \langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}}
.
3. By definition,
f
(
y
)
≥
f
(
x
)
+
∇
f
(
x
)
T
(
y
−
x
)
+
μ
2
‖
y
−
x
‖
2
{\textstyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}}
. Now, minimize the left side, we have
f
(
x
∗
)
≥
f
(
x
)
+
∇
f
(
x
)
T
(
x
∗
−
x
)
+
μ
2
‖
x
∗
−
x
‖
2
{\displaystyle f(x^{*})\geq f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}}
then minimize the right side, we have
f
(
x
)
+
∇
f
(
x
)
T
(
x
∗
−
x
)
+
μ
2
‖
x
∗
−
x
‖
2
≥
f
(
x
)
−
1
2
μ
‖
∇
f
(
x
)
‖
2
{\displaystyle f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}\geq f(x)-{\frac {1}{2\mu }}\|\nabla f(x)\|^{2}}
Combining the two, we have the
μ
{\textstyle \mu }
-PL inequality.
f
(
x
k
)
−
f
(
x
∗
)
≤
(
1
−
μ
/
L
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
{\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}
4.
g
(
A
y
)
≥
g
(
A
x
)
+
⟨
∇
g
(
A
x
)
,
A
y
−
A
x
⟩
+
μ
2
‖
A
y
−
A
x
‖
2
{\displaystyle g(Ay)\geq g(Ax)+\langle \nabla g(Ax),Ay-Ax\rangle +{\frac {\mu }{2}}\|Ay-Ax\|^{2}}
Now, since
∇
f
(
x
)
=
A
T
∇
g
(
A
x
)
{\textstyle \nabla f(x)=A^{T}\nabla g(Ax)}
, we have
f
(
y
)
≥
f
(
x
)
+
⟨
∇
f
(
x
)
,
y
−
x
⟩
+
μ
2
‖
A
(
y
−
x
)
‖
2
{\displaystyle f(y)\geq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {\mu }{2}}\|A(y-x)\|^{2}}
Set
y
{\textstyle y}
to be the projection of
x
{\textstyle x}
to the optimum subspace, then we have
‖
A
(
y
−
x
)
‖
≥
σ
‖
y
−
x
‖
{\textstyle \|A(y-x)\|\geq \sigma \|y-x\|}
. Thus, we have
f
(
y
)
−
f
(
x
)
≥
⟨
∇
f
(
x
)
,
y
−
x
⟩
+
μ
σ
2
2
‖
y
−
x
‖
2
{\displaystyle f(y)-f(x)\geq \langle \nabla f(x),y-x\rangle +{\frac {\mu \sigma ^{2}}{2}}\|y-x\|^{2}}
Vary the
y
{\textstyle y}
on the right side to minimize the right side, we have the desired result.
5. Let
g
(
x
)
:=
f
(
x
)
−
f
∗
{\textstyle g(x):={\sqrt {f(x)-f^{*}}}}
. For any
x
∉
X
∗
{\textstyle x\not \in X^{*}}
, we have
∇
g
(
x
)
=
∇
f
(
x
)
2
f
(
x
)
−
f
∗
{\displaystyle \nabla g(x)={\frac {\nabla f(x)}{2{\sqrt {f(x)-f^{*}}}}}}
so by
μ
{\textstyle \mu }
-PL,
‖
∇
g
(
x
)
‖
2
≥
μ
/
2
{\displaystyle \|\nabla g(x)\|^{2}\geq \mu /2}
In particular, we see that
∇
g
{\textstyle \nabla g}
is a vector field on
R
d
∖
X
∗
{\textstyle \mathbb {R} ^{d}\setminus X^{*}}
with size at least
μ
/
2
{\textstyle {\sqrt {\mu /2}}}
. Define a gradient flow along
∇
g
{\textstyle \nabla g}
with constant unit velocity, starting at
x
(
0
)
=
x
{\textstyle x(0)=x}
:
x
(
0
)
=
x
,
x
˙
(
t
)
=
∇
g
‖
∇
g
‖
{\displaystyle x(0)=x,\quad {\dot {x}}(t)={\frac {\nabla g}{\|\nabla g\|}}}
Because
g
{\textstyle g}
is bounded below by
0
{\textstyle 0}
, and
‖
∇
g
‖
≥
μ
/
2
{\textstyle \|\nabla g\|\geq {\sqrt {\mu /2}}}
, the gradient flow terminates on the zero set
X
∗
{\textstyle X^{*}}
at a finite time
T
≤
g
(
x
)
/
μ
/
2
{\displaystyle T\leq g(x)/{\sqrt {\mu /2}}}
The path length is
T
{\textstyle T}
, since the velocity is constantly 1.
Since
x
(
T
)
{\textstyle x(T)}
is on the zero set, and
x
∗
{\textstyle x^{*}}
is the point closest to
x
{\textstyle x}
, we have
‖
x
∗
−
x
‖
≤
T
≤
g
(
x
)
/
μ
/
2
{\displaystyle \|x^{*}-x\|\leq T\leq g(x)/{\sqrt {\mu /2}}}
which is the desired result.
Gradient descent
Theorem (linear convergence of gradient descent)—If
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL and
∇
f
{\textstyle \nabla f}
is
L
{\textstyle L}
-Lipschitz, then gradient descent with constant step size
η
{\textstyle \eta }
x
k
+
1
=
x
k
−
η
∇
f
(
x
k
)
{\displaystyle x_{k+1}=x_{k}-\eta \nabla f(x_{k})}
converges linearly as
f
(
x
k
)
−
f
(
x
∗
)
≤
(
1
−
2
μ
η
(
1
−
L
η
/
2
)
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
,
η
∈
(
0
,
2
/
L
)
{\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right),\quad \eta \in (0,2/L)}
The convergence is the fastest when
η
=
1
/
L
{\textstyle \eta =1/L}
, at which point
f
(
x
k
)
−
f
(
x
∗
)
≤
(
1
−
μ
/
L
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
{\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}
Since
∇
f
{\textstyle \nabla f}
is
L
{\textstyle L}
-Lipschitz, we have the parabolic upper bound
f
(
x
k
+
1
)
≤
f
(
x
k
)
+
⟨
∇
f
(
x
k
)
,
x
k
+
1
−
x
k
⟩
+
L
2
‖
x
k
+
1
−
x
k
‖
2
{\displaystyle f(x_{k+1})\leq f(x_{k})+\langle \nabla f(x_{k}),x_{k+1}-x_{k}\rangle +{\frac {L}{2}}\|x_{k+1}-x_{k}\|^{2}}
Plugging in the gradient descent step,
f
(
x
k
+
1
)
−
f
(
x
k
)
≤
⟨
∇
f
(
x
k
)
,
−
η
∇
f
(
x
k
)
⟩
+
L
2
‖
−
η
∇
f
(
x
k
)
‖
2
=
(
L
η
2
/
2
−
η
)
‖
∇
f
(
x
k
)
‖
2
≤
2
μ
(
L
η
2
/
2
−
η
)
(
f
(
x
k
)
−
f
(
x
∗
)
)
{\displaystyle {\begin{aligned}f(x_{k+1})-f(x_{k})&\leq \langle \nabla f(x_{k}),-\eta \nabla f(x_{k})\rangle +{\frac {L}{2}}\|-\eta \nabla f(x_{k})\|^{2}\\&=(L\eta ^{2}/2-\eta )\|\nabla f(x_{k})\|^{2}\\&\leq 2\mu (L\eta ^{2}/2-\eta )\left(f(x_{k})-f(x^{*})\right)\end{aligned}}}
Thus,
f
(
x
k
)
−
f
(
x
∗
)
≤
(
1
−
2
μ
η
(
1
−
L
η
/
2
)
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
{\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}
Corollary—1.
x
k
{\textstyle x_{k}}
converges to the optimum set
X
∗
{\textstyle X^{*}}
at a rate of
(
1
−
μ
η
(
2
−
L
η
)
)
{\textstyle \left(1-\mu \eta (2-L\eta )\right)}
.
2. If
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL, not constant, and
∇
f
{\textstyle \nabla f}
is
L
{\textstyle L}
-Lipschitz, then
L
≥
μ
{\textstyle L\geq \mu }
.
3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies
f
(
x
k
)
−
f
(
x
∗
)
≤
(
1
−
μ
/
L
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
{\displaystyle f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)}
Coordinate descent
The coordinate descent algorithm first samples a random coordinate
i
k
{\textstyle i_{k}}
uniformly, then perform gradient descent by
x
k
+
1
=
x
k
−
η
∂
i
k
f
(
x
k
)
e
i
k
{\displaystyle x_{k+1}=x_{k}-\eta \partial _{i_{k}}f(x_{k})e_{i_{k}}}
Theorem—Assume that
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL, and that
∇
f
{\textstyle \nabla f}
is
L
{\textstyle L}
-Lipschitz at each coordinate, meaning that
|
∂
i
f
(
x
+
t
e
i
)
−
∂
i
f
(
x
)
|
≤
L
|
t
|
{\displaystyle |\partial _{i}f(x+te_{i})-\partial _{i}f(x)|\leq L|t|}
Then,
E
[
f
(
x
k
)
−
f
(
x
∗
)
]
{\textstyle \mathbb {E} [f(x_{k})-f(x^{*})]}
converges linearly for all
η
∈
(
0
,
2
/
L
)
{\textstyle \eta \in (0,2/L)}
by
E
[
f
(
x
k
)
−
f
(
x
∗
)
]
≤
(
1
−
μ
η
(
2
−
L
η
)
d
)
k
(
f
(
x
0
)
−
f
(
x
∗
)
)
{\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)^{k}(f(x_{0})-f(x^{*}))}
By the same argument,
f
(
x
k
+
1
)
≤
f
(
x
k
)
+
(
L
η
2
/
2
−
η
)
(
∂
i
k
f
(
x
k
)
)
2
{\displaystyle f(x_{k+1})\leq f(x_{k})+(L\eta ^{2}/2-\eta )(\partial _{i_{k}}f(x_{k}))^{2}}
Take expectation with respect to
i
k
{\textstyle i_{k}}
, we have
E
[
f
(
x
k
+
1
)
]
≤
f
(
x
k
)
+
L
η
2
/
2
−
η
d
‖
∇
f
(
x
k
)
‖
2
{\displaystyle \mathbb {E} [f(x_{k+1})]\leq f(x_{k})+{\frac {L\eta ^{2}/2-\eta }{d}}\|\nabla f(x_{k})\|^{2}}
Plug in the
μ
{\textstyle \mu }
-PL inequality, we have
E
[
f
(
x
k
)
−
f
(
x
∗
)
]
≤
(
1
−
μ
η
(
2
−
L
η
)
d
)
(
f
(
x
k
)
−
f
(
x
∗
)
)
{\displaystyle \mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)(f(x_{k})-f(x^{*}))}
Iterating the process, we have the desired result.
Stochastic gradient descent
In stochastic gradient descent, we have a function to minimize
f
(
x
)
{\textstyle f(x)}
, but we cannot sample its gradient directly. Instead, we sample a random gradient
∇
f
i
(
x
)
{\textstyle \nabla f_{i}(x)}
, where
f
i
{\textstyle f_{i}}
are such that
f
(
x
)
=
E
i
[
f
i
(
x
)
]
{\displaystyle f(x)=\mathbb {E} _{i}[f_{i}(x)]}
For example, in typical machine learning,
x
{\textstyle x}
are the parameters of the neural network, and
f
i
(
x
)
{\textstyle f_{i}(x)}
is the loss incurred on the
i
{\textstyle i}
-th training data point, while
f
(
x
)
{\textstyle f(x)}
is the average loss over all training data points.
The gradient update step is
x
k
+
1
=
x
k
−
η
k
∇
f
i
k
(
x
k
)
{\displaystyle x_{k+1}=x_{k}-\eta _{k}\nabla f_{i_{k}}(x_{k})}
where
η
k
>
0
{\textstyle \eta _{k}>0}
are a sequence of learning rates (the learning rate schedule).
Theorem—If each
∇
f
i
{\textstyle \nabla f_{i}}
is
L
{\textstyle L}
-Lipschitz,
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL, and
f
{\textstyle f}
has global minimum
f
∗
{\textstyle f^{*}}
, then
E
[
f
(
x
k
+
1
)
−
f
∗
]
≤
(
1
−
2
η
k
μ
)
[
f
(
x
k
)
−
f
∗
]
+
L
η
k
2
2
E
i
[
‖
∇
f
i
(
x
k
)
‖
2
]
{\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]}
We can also write it using the variance of gradient L2 norm:
E
[
f
(
x
k
+
1
)
−
f
∗
]
≤
(
1
−
μ
(
2
η
k
−
L
η
k
2
)
)
[
f
(
x
k
)
−
f
∗
]
+
L
η
k
2
2
E
i
[
‖
∇
f
i
(
x
k
)
−
∇
f
(
x
k
)
‖
2
]
{\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]}
Because all
∇
f
i
{\textstyle \nabla f_{i}}
are
L
{\textstyle L}
-Lipschitz, so is
∇
f
{\textstyle \nabla f}
. We thus have
f
(
x
k
+
1
)
≤
f
(
x
k
)
−
η
k
⟨
∇
f
(
x
k
)
,
∇
f
i
k
(
x
k
)
⟩
+
L
η
k
2
2
‖
∇
f
i
k
(
x
k
)
‖
2
{\displaystyle f(x_{k+1})\leq f(x_{k})-\eta _{k}\langle \nabla f(x_{k}),\nabla f_{i_{k}}(x_{k})\rangle +{\frac {L\eta _{k}^{2}}{2}}\|\nabla f_{i_{k}}(x_{k})\|^{2}}
Now, take the expectation over
i
k
{\textstyle i_{k}}
, and use the fact that
f
{\textstyle f}
is
μ
{\textstyle \mu }
-PL. This gives the first equation.
The second equation is shown similarly by noting that
E
i
[
‖
∇
f
i
(
x
k
)
‖
2
]
=
E
i
[
‖
∇
f
i
(
x
k
)
−
∇
f
(
x
k
)
‖
2
]
+
‖
∇
f
(
x
k
)
‖
2
{\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]=\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]+\|\nabla f(x_{k})\|^{2}}
As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.
The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some
C
>
0
{\textstyle C>0}
such that during the SG process, we have
E
i
[
‖
∇
f
i
(
x
k
)
‖
2
]
≤
C
{\displaystyle \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]\leq C}
for all
k
=
0
,
1
,
…
{\textstyle k=0,1,\dots }
, then
E
[
f
(
x
k
+
1
)
−
f
∗
]
≤
(
1
−
2
η
k
μ
)
[
f
(
x
k
)
−
f
∗
]
+
L
C
η
k
2
2
{\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}}
Similarly, if
∀
k
,
E
i
[
‖
∇
f
i
(
x
k
)
−
∇
f
(
x
k
)
‖
2
]
≤
C
{\displaystyle \forall k,\quad \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]\leq C}
then
E
[
f
(
x
k
+
1
)
−
f
∗
]
≤
(
1
−
μ
(
2
η
k
−
L
η
k
2
)
)
[
f
(
x
k
)
−
f
∗
]
+
L
C
η
k
2
2
{\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}}
Learning rate schedules
For constant learning rate schedule, with
η
k
=
η
=
1
/
L
{\textstyle \eta _{k}=\eta =1/L}
, we have
E
[
f
(
x
k
+
1
)
−
f
∗
]
≤
(
1
−
μ
/
L
)
[
f
(
x
k
)
−
f
∗
]
+
C
2
L
{\displaystyle \mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu /L\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {C}{2L}}}
By induction, we have
E
[
f
(
x
k
)
−
f
∗
]
≤
(
1
−
μ
/
L
)
k
[
f
(
x
0
)
−
f
∗
]
+
C
2
μ
{\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]\leq \left(1-\mu /L\right)^{k}\left[f\left(x_{0}\right)-f^{*}\right]+{\frac {C}{2\mu }}}
We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the
C
/
(
2
L
)
{\textstyle C/(2L)}
term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and
x
k
{\textstyle x_{k}}
starts doing a random walk in the vicinity of
X
∗
{\textstyle X^{*}}
.
For decreasing learning rate schedule with
η
k
=
O
(
1
/
k
)
{\textstyle \eta _{k}=O(1/k)}
, we have
E
[
f
(
x
k
)
−
f
∗
]
=
O
(
1
/
k
)
{\displaystyle \mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]=O(1/k)}
.
References
- Polyak, B. T. (1 January 1963). "Gradient methods for the minimisation of functionals". USSR Computational Mathematics and Mathematical Physics. 3 (4): 864–878. doi:10.1016/0041-5553(63)90382-3. Retrieved 27 December 2025.
- Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
- Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
- Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
- Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.
External links
- "Lojasiewicz inequality", Encyclopedia of Mathematics, EMS Press, 2001 [1994]