A quasi-Newton approach to non-smooth convex optimization

We extend the well-known BFGS quasi-Newton method and its memory-limited variant LBFGS to the optimization of nonsmooth convex objectives. This is done in a rigorous fashion by generalizing three components of BFGS to subdifferentials: the local quadratic model, the identification of a descent direction, and the Wolfe line search conditions. We prove that under some technical conditions, the resulting subBFGS algorithm is globally convergent in objective function value. We apply its memory-limited variant (subLBFGS) to L2-regularized risk minimization with the binary hinge loss. To extend our algorithm to the multiclass and multilabel settings, we develop a new, efficient, exact line search algorithm. We prove its worst-case time complexity bounds, and show that our line search can also be used to extend a recently developed bundle method to the multiclass and multilabel settings. We also apply the direction-finding component of our algorithm to L1-regularized risk minimization with logistic loss. In all these contexts our methods perform comparable to or better than specialized state-of-the-art solvers on a number of publicly available data sets. An open source implementation of our algorithms is freely available.

[1]  P. Wolfe Convergence Conditions for Ascent Methods. II , 1969 .

[2]  Philip Wolfe,et al.  Note on a method of conjugate subgradients for minimizing nondifferentiable functions , 1974, Math. Program..

[3]  P. Wolfe Note on a method of conjugate subgradients for minimizing nondifferentiable functions , 1974 .

[4]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[5]  John Hershberger,et al.  Finding the Upper Envelope of n Line Segments in O(n log n) Time , 1989, Inf. Process. Lett..

[6]  J. Hiriart-Urruty,et al.  Convex analysis and minimization algorithms , 1993 .

[7]  Micha Sharir,et al.  Davenport-Schinzel sequences and their geometric applications , 1995, Handbook of Computational Geometry.

[8]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[9]  L. Qi,et al.  A General Approach to Convergence Properties of Some Methods for Nonsmooth Convex Optimization , 1998 .

[10]  L. Luksan,et al.  Globally Convergent Variable Metric Method for Convex Nonsmooth Unconstrained Minimization1 , 1999 .

[11]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[12]  D. Bertsekas,et al.  Convergen e Rate of In remental Subgradient Algorithms , 2000 .

[13]  N. Abe,et al.  Polynomial Learnability of Stochastic Rules with Respect to the KL-Divergence and Quadratic Distance , 2001 .

[14]  Yuh-Jye Lee,et al.  SSVM: A Smooth Support Vector Machine for Classification , 2001, Comput. Optim. Appl..

[15]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[16]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[17]  Koby Crammer,et al.  A Family of Additive Online Algorithms for Category Ranking , 2003, J. Mach. Learn. Res..

[18]  Y. Singer,et al.  Ultraconservative online algorithms for multiclass problems , 2003 .

[19]  Leonidas J. Guibas,et al.  Kinetic Data Structures , 2004, Handbook of Data Structures and Applications.

[20]  Arkadi Nemirovski,et al.  Prox-Method with Rate of Convergence O(1/t) for Variational Inequalities with Lipschitz Continuous Monotone Operators and Smooth Convex-Concave Saddle Point Problems , 2004, SIAM J. Optim..

[21]  Marjo S. Haarala Large-scale nonsmooth optimization : variable metric bundle method with limited memory , 2004 .

[22]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[23]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[24]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[25]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[26]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[27]  Ludovic Denoyer,et al.  XML Structure Mapping , 2006, INEX.

[28]  Jason Weston,et al.  Solving multiclass support vector machines with LaRank , 2007, ICML '07.

[29]  Richard I. Hartley,et al.  Optimal Algorithms in Multiview Geometry , 2007, ACCV.

[30]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[31]  R. Hartley,et al.  Multiple-View Geometry under the L 1-Norm , 2007 .

[32]  Alexander J. Smola,et al.  Bundle Methods for Machine Learning , 2007, NIPS.

[33]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[34]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[35]  N. Schraudolph,et al.  A quasi-Newton approach to non-smooth convex optimization , 2008, ICML '08.

[36]  S. V. N. Vishwanathan,et al.  Entropy Regularized LPBoost , 2008, ALT.

[37]  Sören Sonnenburg,et al.  Optimized cutting plane algorithm for support vector machines , 2008, ICML '08.

[38]  A. Lewis,et al.  BEHAVIOR OF BFGS WITH AN EXACT LINE SEARCH ON NONSMOOTH EXAMPLES , 2008 .

[39]  Richard I. Hartley,et al.  Multiple-View Geometry Under the {$L_\infty$}-Norm , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  M. Overton NONSMOOTH OPTIMIZATION VIA BFGS , 2008 .

[41]  Bettina Speckmann Kinetic Data Structures , 2008, Encyclopedia of Algorithms.

[42]  Sören Sonnenburg,et al.  Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization , 2009, J. Mach. Learn. Res..

[43]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[44]  Yoram Singer,et al.  On the equivalence of weak learnability and linear separability: new relaxations and efficient boosting algorithms , 2010, Machine Learning.