论文信息 - A control-theoretic perspective on optimal high-order optimization

A control-theoretic perspective on optimal high-order optimization

We provide a control-theoretic perspective on optimal tensor algorithms for minimizing a convex function in a finite-dimensional Euclidean space. Given a function $$\varPhi : {\mathbb {R}}^d \rightarrow {\mathbb {R}}$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>Φ</mml:mi> <mml:mo>:</mml:mo> <mml:msup> <mml:mrow> <mml:mi>R</mml:mi> </mml:mrow> <mml:mi>d</mml:mi> </mml:msup> <mml:mo>→</mml:mo> <mml:mi>R</mml:mi> </mml:mrow> </mml:math> that is convex and twice continuously differentiable, we study a closed-loop control system that is governed by the operators $$\nabla \varPhi $$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>∇</mml:mi> <mml:mi>Φ</mml:mi> </mml:mrow> </mml:math> and $$\nabla ^2 \varPhi $$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:msup> <mml:mi>∇</mml:mi> <mml:mn>2</mml:mn> </mml:msup> <mml:mi>Φ</mml:mi> </mml:mrow> </mml:math> together with a feedback control law $$\lambda (\cdot )$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>λ</mml:mi> <mml:mo>(</mml:mo> <mml:mo>·</mml:mo> <mml:mo>)</mml:mo> </mml:mrow> </mml:math> satisfying the algebraic equation $$(\lambda (t))^p\Vert \nabla \varPhi (x(t))\Vert ^{p-1} = \theta $$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:msup> <mml:mrow> <mml:mo>(</mml:mo> <mml:mi>λ</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mi>t</mml:mi> <mml:mo>)</mml:mo> </mml:mrow> <mml:mo>)</mml:mo> </mml:mrow> <mml:mi>p</mml:mi> </mml:msup> <mml:msup> <mml:mrow> <mml:mo>‖</mml:mo> <mml:mi>∇</mml:mi> <mml:mi>Φ</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mi>x</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mi>t</mml:mi> <mml:mo>)</mml:mo> </mml:mrow> <mml:mo>)</mml:mo> </mml:mrow> <mml:mo>‖</mml:mo> </mml:mrow> <mml:mrow> <mml:mi>p</mml:mi> <mml:mo>-</mml:mo> <mml:mn>1</mml:mn> </mml:mrow> </mml:msup> <mml:mo>=</mml:mo> <mml:mi>θ</mml:mi> </mml:mrow> </mml:math> for some $$\theta \in (0, 1)$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>θ</mml:mi> <mml:mo>∈</mml:mo> <mml:mo>(</mml:mo> <mml:mn>0</mml:mn> <mml:mo>,</mml:mo> <mml:mn>1</mml:mn> <mml:mo>)</mml:mo> </mml:mrow> </mml:math>. Our first contribution is to prove the existence and uniqueness of a local solution to this system via the Banach fixed-point theorem. We present a simple yet nontrivial Lyapunov function that allows us to establish the existence and uniqueness of a global solution under certain regularity conditions and analyze the convergence properties of trajectories. The rate of convergence is $$O(1/t^{(3p+1)/2})$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>O</mml:mi> <mml:mo>(</mml:mo> <mml:mn>1</mml:mn> <mml:mo>/</mml:mo> <mml:msup> <mml:mi>t</mml:mi> <mml:mrow> <mml:mo>(</mml:mo> <mml:mn>3</mml:mn> <mml:mi>p</mml:mi> <mml:mo>+</mml:mo> <mml:mn>1</mml:mn> <mml:mo>)</mml:mo> <mml:mo>/</mml:mo> <mml:mn>2</mml:mn> </mml:mrow> </mml:msup> <mml:mo>)</mml:mo> </mml:mrow> </mml:math> in terms of objective function gap and $$O(1/t^{3p})$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>O</mml:mi> <mml:mo>(</mml:mo> <mml:mn>1</mml:mn> <mml:mo>/</mml:mo> <mml:msup> <mml:mi>t</mml:mi> <mml:mrow> <mml:mn>3</mml:mn> <mml:mi>p</mml:mi> </mml:mrow> </mml:msup> <mml:mo>)</mml:mo> </mml:mrow> </mml:math> in terms of squared gradient norm. Our second contribution is to provide two algorithmic frameworks obtained from discretization of our continuous-time system, one of which generalizes the large-step A-HPE framework of Monteiro and Svaiter (SIAM J Optim 23(2):1092–1125, 2013) and the other of which leads to a new optimal p-th order tensor algorithm. While our discrete-time analysis can be seen as a simplification and generalization of Monteiro and Svaiter (2013), it is largely motivated by the aforementioned continuous-time analysis, demonstrating the fundamental role that the feedback control plays in optimal acceleration and the clear advantage that the continuous-time perspective brings to algorithmic design. A highlight of our analysis is that we show that all of the p-th order optimal tensor algorithms that we discuss minimize the squared gradient norm at a rate of $$O(k^{-3p})$$<mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"> <mml:mrow> <mml:mi>O</mml:mi> <mml:mo>(</mml:mo> <mml:msup> <mml:mi>k</mml:mi> <mml:mrow> <mml:mo>-</mml:mo> <mml:mn>3</mml:mn> <mml:mi>p</mml:mi> </mml:mrow> </mml:msup> <mml:mo>)</mml:mo> </mml:mrow> </mml:math>, which complements the recent analysis in Gasnikov et al. (in: COLT, PMLR, pp 1374–1391, 2019), Jiang et al. (in: COLT, PMLR, pp 1799–1801, 2019) and Bubeck et al. (in: COLT, PMLR, pp 492–507, 2019).

Michael I. Jordan | Michael. I. Jordan | Tianyi Lin | Tianyi Lin