Ergodic mirror descent

We generalize stochastic subgradient methods to situations in which we do not receive independent samples from the distribution over which we optimize, but instead receive samples that are coupled over time. We show that as long as the source of randomness is suitably ergodic — it converges quickly enough to a stationary distribution — the method enjoys strong convergence guarantees, both in expectation and with high probability. This result has implications for high-dimensional stochastic optimization, peer-to-peer distributed optimization schemes, and stochastic optimization problems over combinatorial spaces.

[1]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[2]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[3]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[4]  D. Bertsekas Stochastic optimization problems with nondifferentiable cost functionals , 1973 .

[5]  Yakov Z. Tsypkin,et al.  Robust identification , 1980, Autom..

[6]  R. Rockafellar,et al.  On the interchange of subdifferentiation and conditional expectation for convex functionals , 1982 .

[7]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[8]  R. C. Bradley Basic Properties of Strong Mixing Conditions , 1985 .

[9]  Patrick Billingsley,et al.  Probability and Measure. , 1986 .

[10]  A. Mokkadem Mixing properties of ARMA processes , 1988 .

[11]  Russell Impagliazzo,et al.  How to recycle random bits , 1989, 30th Annual Symposium on Foundations of Computer Science.

[12]  G. C. Wei,et al.  A Monte Carlo Implementation of the EM Algorithm and the Poor Man's Data Augmentation Algorithms , 1990 .

[13]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[14]  L. Khachiyan,et al.  On the conductance of order Markov chains , 1991 .

[15]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[16]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[17]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[18]  Richard L. Tweedie,et al.  Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[19]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[20]  Mark Jerrum,et al.  The Markov chain Monte Carlo method: an approach to approximate counting and integration , 1996 .

[21]  K. Marton Measure concentration for a class of random processes , 1998 .

[22]  Paul-Marie Samson,et al.  Concentration of measure inequalities for Markov chains and $\Phi$-mixing processes , 2000 .

[23]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[24]  Dimitri P. Bertsekas,et al.  Incremental Subgradient Methods for Nondifferentiable Optimization , 2001, SIAM J. Optim..

[25]  Arkadi Nemirovski,et al.  The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography , 2001, SIAM J. Optim..

[26]  Jiri Matousek,et al.  Lectures on discrete geometry , 2002, Graduate texts in mathematics.

[27]  G. Roberts,et al.  Polynomial convergence rates of Markov chains. , 2002 .

[28]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[29]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[30]  James C. Spall,et al.  Introduction to stochastic search and optimization - estimation, simulation, and control , 2003, Wiley-Interscience series in discrete mathematics and optimization.

[31]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[32]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  D. Wilson Mixing times of lozenge tiling and card shuffling Markov chains , 2001, math/0102193.

[35]  B. Anderson,et al.  ROBUST IDENTIFICATION OF , 2005 .

[36]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[37]  Chris Mesterharm,et al.  On-line Learning with Delayed Label Feedback , 2005, ALT.

[38]  E. Liebscher Towards a Unified Approach for Proving Geometric Ergodicity and Mixing Properties of Nonlinear Autoregressive Processes , 2005 .

[39]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[40]  Stephen P. Boyd,et al.  Randomized gossip algorithms , 2006, IEEE Transactions on Information Theory.

[41]  H. Robbins A Stochastic Approximation Method , 1951 .

[42]  James C. Spall,et al.  Introduction to Stochastic Search and Optimization. Estimation, Simulation, and Control (Spall, J.C. , 2007 .

[43]  Richard R. Brooks,et al.  Distributed Sensor Networks: A Multiagent Perspective , 2008 .

[44]  Mikael Johansson,et al.  A Randomized Incremental Subgradient Method for Distributed Optimization in Networked Systems , 2009, SIAM J. Optim..

[45]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[46]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[47]  Angelia Nedic,et al.  Incremental Stochastic Subgradient Algorithms for Convex Optimization , 2008, SIAM J. Optim..

[48]  Michael I. Jordan,et al.  Ergodic Subgradient Descent , 2011 .

[49]  Ergodic Mirror Descent , 2012, SIAM J. Optim..

[50]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[51]  J. Norris Appendix: probability and measure , 1997 .

[52]  U. Feige,et al.  Spectral Graph Theory , 2015 .