Divide-and-conquer checkpointing for arbitrary programs with no user annotation

Classical reverse-mode automatic differentiation (AD) imposes only a small constant-factor overhead in operation count over the original computation, but has storage requirements that grow, in the worst case, in proportion to the time consumed by the original computation. This storage blowup can be ameliorated by checkpointing, a process that reorders application of classical reverse-mode AD over an execution interval to tradeoff space vs. time. Application of checkpointing in a divide-and-conquer fashion to strategically chosen nested execution intervals can break classical reverse-mode AD into stages which can reduce the worst-case growth in storage from linear to sublinear. Doing this has been fully automated only for computations of particularly simple form, with checkpoints spanning execution intervals resulting from a limited set of program constructs. Here we show how the technique can be automated for arbitrary computations. The essential innovation is to apply the technique at the level of the language implementation itself, thus allowing checkpoints to span any execution interval.

[1]  Gerald Jay Sussman,et al.  Lambda: The Ultimate Imperative , 1976 .

[2]  B. Speelpenning Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[3]  Daniel P. Friedman,et al.  Engines build process abstractions , 1984, LFP '84.

[4]  G. M. Ostrovskii,et al.  Automatic computation of derivatives with the use of the multilevel differentiating technique—1. Algorithmic basis , 1985 .

[5]  Philip Wadler,et al.  Comprehending monads , 1990, LISP and Functional Programming.

[6]  Andrew W. Appel,et al.  Compiling with Continuations , 1991 .

[7]  Marinus J. Plasmeijer,et al.  High Level Specification of I/O in Functional Languages , 1992, Functional Programming.

[8]  Andreas Griewank,et al.  Achieving logarithmic growth of temporal and spatial complexity in reverse automatic differentiation , 1992 .

[9]  John C. Reynolds,et al.  The discoveries of continuations , 1993, LISP Symb. Comput..

[10]  Bruce Christianson Reverse accumulation of functions containing gradients , 1993 .

[11]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[12]  Richard A. Kelsey A correspondence between continuation passing style and static single assignment form , 1995 .

[13]  Andreas Griewank,et al.  Algorithm 755: ADOL-C: a package for the automatic differentiation of algorithms written in C/C++ , 1996, TOMS.

[14]  C. Bendtsen FADBAD, a flexible C++ package for automatic differentiation - using the forward and backward method , 1996 .

[15]  Gerald J. Sussman,et al.  Scheme: A Interpreter for Extended Lambda Calculus , 1998, High. Order Symb. Comput..

[16]  Richard Heller,et al.  Checkpointing without operating system intervention: Implementing Griewank's algorithm , 1998 .

[17]  Andrew W. Appel,et al.  SSA is functional programming , 1998, SIGP.

[18]  Andreas Griewank,et al.  Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation , 2000, TOMS.

[19]  Alexei Stovboun,et al.  A tool for creating high-speed, memory efficient derivative codes for large scale applications , 2000 .

[20]  Gerald J. Sussman,et al.  Structure and interpretation of classical mechanics , 2001 .

[21]  Yixiu Kang,et al.  Implementation of Forward and Reverse Mode Automatic Differentiation for GNU Octave Applications , 2003 .

[22]  Laurent Hascoët,et al.  TAPENADE 2.1 user's guide , 2004 .

[23]  Stephen Weeks,et al.  Whole-program compilation in MLton , 2006, ML '06.

[24]  Laurent Hascoët,et al.  The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation , 2006, International Conference on Computational Science.

[25]  Andrea Walther,et al.  Optimal Checkpointing for Time-Stepping Procedures in ADOL-C , 2006, International Conference on Computational Science.

[26]  Barak A. Pearlmutter,et al.  Using Polyvariant Union-Free Flow Analysis to Compile aHigher-Order Functional-Programming Language with aFirst-Class Derivative Operator to Efficient Fortran-like Code , 2008 .

[27]  Barak A. Pearlmutter,et al.  Nesting forward-mode AD in a functional framework , 2008, High. Order Symb. Comput..

[28]  Barak A. Pearlmutter,et al.  Using Programming Language Theory to Make Automatic Differentiation Sound and Efficient , 2008 .

[29]  Barak A. Pearlmutter,et al.  Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[30]  Barak A. Pearlmutter,et al.  Putting the Automatic Back into AD:Part I, What’s Wrong (CVS: 1.1) , 2008 .

[31]  Barak A. Pearlmutter,et al.  AD in F ORTRAN Implementation via Prepreprocessor , 2012 .

[32]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[33]  Barak A. Pearlmutter,et al.  Efficient Implementation of a Higher-Order Language with Built-In AD , 2016, ArXiv.

[34]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[35]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[36]  Barak A. Pearlmutter,et al.  Binomial Checkpointing for Arbitrary Programs with No User Annotation , 2016, ArXiv.

[37]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[38]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[39]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.