Memory Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter, thus introducing significant memory overheads that restrict the size of the model being used as well as the number of examples in a mini-batch. We describe an effective and flexible adaptive optimization method with greatly reduced memory overhead. Our method retains the benefits of per-parameter adaptivity while allowing significantly larger models and batch sizes. We give convergence guarantees for our method, and demonstrate its effectiveness in training very large translation and language models with up to 2-fold speedups compared to the state-of-the-art.

[1]  Yoram Singer,et al.  Memory-Efficient Adaptive Optimization for Large-Scale Learning , 2019, ArXiv.

[2]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[3]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[4]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Kunle Olukotun,et al.  Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark , 2018, ACM SIGOPS Oper. Syst. Rev..

[7]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[8]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[9]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[10]  Yi Zhang,et al.  The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.

[11]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[15]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[19]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[20]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[22]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[23]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[24]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[25]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[26]  Alok Aggarwal,et al.  Regularized Evolution for Image Classifier Architecture Search , 2018, AAAI.

[27]  Anastasios Kyrillidis,et al.  Compressing Gradient Optimizers via Count-Sketches , 2019, ICML.