Tractable Optimization in Machine Learning

Machine Learning (ML) broadly encompasses a variety of adaptive, autonomous, and intelligent tasks where one must “learn” to predict from observations and feedback. Throughout its evolution, ML has drawn heavily and successfully on optimization algorithms. This relation to optimization is not surprising as “learning” and “adapting” usually lead to problems where some quality function must be optimized. But the interaction between ML and optimization is now undergoing rapid change. The increased size, complexity, and variety seen in ML problems, not only prompt a refinement of existing optimization techniques but also spur development of new methods tuned to the specific needs of ML applications. In particular, ML applications must usually cope with large-scale data, which forces us to prefer “simpler,” perhaps less accurate but more scalable algorithms. Such methods can also crunch through more data, and may actually be better suited for learning—for a more precise characterization see [Bottou and Bousquet, 2011]. The use of possibly less accurate methods is also grounded in pragmatic realities: modeling limitations, observational noise, uncertainty, and computational errors are pervasive in real data. Hence, trusting more than a few digits of numerical accuracy would be unrealistic. From an engineering perspective, simpler algorithms translate into more reliable software that is easier to implement, debug, and deploy. Before we get carried away by these benefits, we must recall a sobering statement of Nesterov [2004]: “in general, optimization problems are unsolvable.” In other words, obtaining globally optimal solutions is in general intractable. Fortunately, nature makes a generous exception for convex optimization, which is not only tractable [Nemirovsky and Yudin, 1983] but also widely applicable [Boyd and Vandenberghe, 2004]. We, therefore, limit our attention to convex optimization, and therein we focus on algorithms that are simple, scalable, and amenable to theoretical analysis that qualifies their tractability. The superficiality of a summary such as the one attempted herein is ineluctable. Nevertheless, we hope that it still provides a quick entry into large-scale convex optimization for non-experts, while offering pointers to literature that even more experienced readers might find useful.

[1]  Xinhua Zhang,et al.  Accelerated training of max-margin Markov networks with kernels , 2014, Theor. Comput. Sci..

[2]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[3]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[4]  Maxim Raginsky,et al.  Information-Based Complexity, Feedback and Dynamics in Convex Programming , 2010, IEEE Transactions on Information Theory.

[5]  Kaizhu Huang,et al.  Sparse Metric Learning via Smooth Optimization , 2009, NIPS.

[6]  Zhang Liu,et al.  Interior-point methods for large-scale cone programming , 2011 .

[7]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[8]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[9]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[10]  H. Robbins A Stochastic Approximation Method , 1951 .

[11]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[12]  A. Juditsky,et al.  5 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , I : General Purpose Methods , 2010 .

[13]  Yurii Nesterov,et al.  How to advance in Structural Convex Optimization , 2008 .

[14]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[15]  Peter Richtárik,et al.  Smooth minimization of nonsmooth functions with parallel coordinate descent methods , 2013, Modeling and Optimization: Theory and Applications.

[16]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[17]  Samuel Burer,et al.  A First-Order Smoothing Technique for a Class of Large-Scale Linear Programs , 2014, SIAM J. Optim..

[18]  Julien Mairal,et al.  Convex optimization with sparsity-inducing norms , 2011 .

[19]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[20]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[21]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[22]  Yuri M. Ermoliev Stochastic Quasigradient Methods: Applications , 2009, Encyclopedia of Optimization.

[23]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[24]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[25]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[26]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[27]  R. Koenker,et al.  The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators , 1997 .

[28]  Guanghui Lan Level methods uniformly optimal for composite and structured nonsmooth convex optimization , 2011 .

[29]  Michael Patriksson,et al.  A Survey on a Classic Core Problem in Operations Research , 2005 .

[30]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[31]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[32]  A. Juditsky 6 First-Order Methods for Nonsmooth Convex Large-Scale Optimization , II : Utilizing Problem ’ s Structure , 2010 .

[33]  Masashi Sugiyama,et al.  Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation , 2009, J. Mach. Learn. Res..

[34]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[35]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[36]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[37]  V. M. Tikhomirov The Evolution of Methods of Convex Optimization , 1996 .

[38]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[39]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[40]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[41]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[42]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[43]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[44]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[45]  Laurent El Ghaoui,et al.  Robust Optimization , 2021, ICORES.

[46]  O. SIAMJ. PROX-METHOD WITH RATE OF CONVERGENCE O(1/t) FOR VARIATIONAL INEQUALITIES WITH LIPSCHITZ CONTINUOUS MONOTONE OPERATORS AND SMOOTH CONVEX-CONCAVE SADDLE POINT PROBLEMS∗ , 2004 .

[47]  Inderjit S. Dhillon,et al.  A scalable trust-region algorithm with application to mixed-norm regression , 2010, ICML.

[48]  Tony F. Chan,et al.  A General Framework for a Class of First Order Primal-Dual Algorithms for Convex Optimization in Imaging Science , 2010, SIAM J. Imaging Sci..

[49]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[50]  Andreas Krause,et al.  Efficient Minimization of Decomposable Submodular Functions , 2010, NIPS.

[51]  Katya Scheinberg,et al.  Introduction to derivative-free optimization , 2010, Math. Comput..

[52]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[53]  Guanghui Lan,et al.  Bundle-level type methods uniformly optimal for smooth and nonsmooth convex optimization , 2013, Math. Program..

[54]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[55]  Yurii Nesterov,et al.  Excessive Gap Technique in Nonsmooth Convex Minimization , 2005, SIAM J. Optim..

[56]  Richard G. Baraniuk,et al.  Compressive Sensing , 2008, Computer Vision, A Reference Guide.

[57]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[58]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1995 .

[59]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[60]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[61]  Stephen J. Wright,et al.  Sparse Reconstruction by Separable Approximation , 2008, IEEE Transactions on Signal Processing.

[62]  Elad Hazan The convex optimization approach to regret minimization , 2011 .