Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

[1]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[2]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[3]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[4]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[5]  Chi-Kwong Li Geometric Means , 2003 .

[6]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[7]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[9]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[10]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[11]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[14]  Karl Löwner Über monotone Matrixfunktionen , 1934 .

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[17]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Yoram Singer,et al.  A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization , 2017, ArXiv.

[20]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[21]  R. Fletcher Practical Methods of Optimization , 1988 .

[22]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[23]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[24]  Adrian S. Lewis,et al.  Nonsmooth optimization via quasi-Newton methods , 2012, Mathematical Programming.

[25]  Shai Shalev-Shwartz,et al.  Faster SGD Using Sketched Conditioning , 2015, ArXiv.