Learning to learn by gradient descent by gradient descent

The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.

[1]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[2]  Peter R. Conwell,et al.  Fixed-weight networks can learn , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[3]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[4]  Richard J. Mammone,et al.  Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[5]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[6]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[7]  J. Schmidhuber,et al.  A neural network that embeds its own meta-levels , 1993, IEEE International Conference on Neural Networks.

[8]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[11]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[12]  G. V. Puskorius,et al.  A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification , 1998, Proc. IEEE.

[13]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[14]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[15]  A. Steven Younger,et al.  Fixed-weight on-line learning , 1999, IEEE Trans. Neural Networks.

[16]  Magnus Thor Jonsson,et al.  Evolution and design of distributed learning rules , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[17]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[18]  Sepp Hochreiter,et al.  Meta-learning with backpropagation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[19]  Danil V. Prokhorov,et al.  Adaptive behavior with fixed weights in RNN: an overview , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[20]  Jürgen Schmidhuber,et al.  Shifting Inductive Bias with Success-Story Algorithm, Adaptive Levin Search, and Incremental Self-Improvement , 1997, Machine Learning.

[21]  Samy Bengio,et al.  On the search for new learning rules for ANNs , 1995, Neural Processing Letters.

[22]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[25]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[26]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[27]  Ted K. Ralphs,et al.  Integer and Combinatorial Optimization , 2013 .

[28]  Razvan Pascanu,et al.  Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[29]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[30]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[31]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[32]  S. Frick,et al.  Compressed Sensing , 2014, Computer Vision, A Reference Guide.

[33]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[34]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[35]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[36]  Sebastian Nowozin,et al.  Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[37]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.