No More Pesky Learning Rates : Supplementary Material

If we do gradient descent with η * (t), then almost surely, the algorithm converges (for the quadratic model). To prove that, we follow classical techniques based on Lyapunov stability theory (Bucy, 1965). Notice that the expected loss follows E J θ (t+1) | θ (t) = 1 2 h · E (1 − η * h)(θ (t) − θ *) + η * hσξ 2 + σ 2 = 1 2 h (1 − η * h) 2 (θ (t) − θ *) 2 + (η *) 2 h 2 σ 2 + σ 2 = 1 2 h σ 2 (θ (t) − θ *) 2 + σ 2 (θ (t) − θ *) 2 + σ 2 ≤ J θ (t) Thus J(θ (t)) is a positive super-martingale, indicating that almost surely J(θ (t)) → J ∞. We are to prove that almost surely J ∞ = J(θ *) = 1 2 hσ 2. Observe that J(θ (t)) − E[J(θ (t+1)) | θ (t) ] = 1 2 hη * (t) , E[J(θ (t))] − E[J(θ (t+1)) | θ (t) ] = 1 2 hE[η * (t)] Since E[J(θ (t))] is bounded below by 0, the telescoping sum gives us E[η * (t)] → 0, which in turn implies that in probability η * (t) → 0. We can rewrite this as η * (t) = J(θ t) − 1 2 hσ 2 J(θ t) → 0 By uniqueness of the limit, almost surely,