论文信息 - Automatic differentiation in machine learning: a survey

Automatic differentiation in machine learning: a survey

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply “auto-diff”, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other’s results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names “dynamic computational graphs” and “differentiable programming”. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main imple- mentation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms “autodiff”, “automatic differentiation”, and “symbolic differentiation” as these are encountered more and more in machine learning settings.

[1] R. V. Gamkrelidze,et al. THE THEORY OF OPTIMAL PROCESSES. I. THE MAXIMUM PRINCIPLE , 1960 .

[2] A. E. Bryson,et al. A Steepest-Ascent Method for Solving Optimum Programming Problems , 1962 .

[3] R. E. Wengert,et al. A simple automatic derivative evaluation program , 1964, Commun. ACM.

[4] Arthur E. Bryson,et al. Applied Optimal Control , 1969 .

[5] Ludovít Molnár,et al. Analytical differentiation on a digital computer , 1970, Kybernetika.

[6] J. Meditch,et al. Applied optimal control , 1972, IEEE Transactions on Automatic Control.

[7] David Q. Mayne,et al. Differential dynamic programming , 1972, The Mathematical Gazette.

[8] F. L. Bauer. Computational Graphs and Rounding Error , 1974 .

[9] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[10] D. Peng,et al. A New Two-Constant Equation of State , 1976 .

[11] S. Linnainmaa. Taylor expansion of the accumulated rounding error , 1976 .

[12] Berthold K. P. Horn. Understanding Image Intensities , 1977, Artif. Intell..

[13] George M. Siouris,et al. Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[14] D. J. Bell,et al. Numerical Methods for Unconstrained Optimization , 1979 .

[15] B. Speelpenning. Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[16] Bengt Fornberg,et al. Numerical Differentiation of Analytic Functions , 1981, TOMS.

[17] John E. Dennis,et al. Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[18] Geoffrey E. Hinton,et al. Learning representations by back-propagating errors , 1986, Nature.

[19] Geoffrey E. Hinton,et al. A general framework for parallel distributed processing , 1986 .

[20] William H. Press,et al. Numerical Recipes: The Art of Scientific Computing , 1987 .

[21] S. Duane,et al. Hybrid Monte Carlo , 1987 .

[22] F W Pfeiffer,et al. Automatic differentiation in prose , 1987, SGNM.

[23] W. Press,et al. Numerical Recipes: The Art of Scientific Computing , 1987 .

[24] M. Bertero,et al. Ill-posed problems in early vision , 1988, Proc. IEEE.

[25] Léon Bottou,et al. Sn: A simulator for connectionist models , 1988 .

[26] Griewank,et al. On automatic differentiation , 1988 .

[27] R. D. Neidinger. Automatic Differentiation and APL , 1989 .

[28] John Peterson. Untagged data in tagged environments: choosing optimal representations at compile time , 1989, FPCA.

[29] Andrew W. Appel,et al. Runtime tags aren't necessary , 1989, LISP Symb. Comput..

[30] Robert Hecht-Nielsen,et al. Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[31] Bernard Widrow,et al. 30 years of adaptive neural networks: perceptron, Madaline, and backpropagation , 1990, Proc. IEEE.

[32] Olin Shivers,et al. Control-flow analysis of higher-order languages of taming lambda , 1991 .

[33] Simon L. Peyton Jones,et al. Unboxed Values as First Class Citizens in a Non-Strict Functional Language , 1991, FPCA.

[34] David W. Juedes,et al. A taxonomy of automatic differentiation tools , 1991 .

[35] Claude Brezinski,et al. Extrapolation methods - theory and practice , 1993, Studies in computational mathematics.

[36] Lawrence C. Rich,et al. Automatic differentiation in MATLAB , 1992 .

[37] Peter Sestoft,et al. Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[38] Brian W. Kernighan,et al. AMPL: A Modeling Language for Mathematical Programming , 1993 .

[39] Philip Wadler,et al. The Glasgow Haskell Compiler: a technical overview , 1993 .

[40] B. Christianson. Reverse accumulation and attractive fixed points , 1994 .

[41] R. L. Hinkins,et al. Parallel computation of automatic differentiation applied to magnetic field calculations , 1994 .

[42] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[43] Jean-Yves Audibert. Optimization for Machine Learning , 1995 .

[44] S. Chib,et al. Understanding the Metropolis-Hastings Algorithm , 1995 .

[45] Murray Hill. Automatically Finding and Exploiting Partially Separable Structure in Nonlinear Programming Problems , 1996 .

[46] Yann LeCun,et al. Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[47] Christian Bischof,et al. Adifor 2.0: automatic differentiation of Fortran 77 programs , 1996 .

[48] C. Bendtsen. FADBAD, a flexible C++ package for automatic differentiation - using the forward and backward method , 1996 .

[49] M. Berz,et al. COSY INFINITY and Its Applications in Nonlinear Dynamics , 1996 .

[50] D. Gay. Automatically Finding and Exploiting Partially Separable Structure in Nonlinear Programming Problems , 1996 .

[51] C. Bert,et al. Differential Quadrature Method in Computational Mechanics: A Review , 1996 .

[52] Jorge Nocedal,et al. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[53] E. Tziperman,et al. Finite Difference of Adjoint or Adjoint of Finite Difference , 1997 .

[54] Christian Bischof,et al. ADIC: an extensible automatic differentiation tool for ANSI-C , 1997 .

[55] F. Potra,et al. Sensitivity analysis for atmospheric chemistry models via automatic differentiation , 1997 .

[56] M. Jerrell. Automatic Differentiation and Interval Arithmetic for Estimation of Disequilibrium Models , 1997 .

[57] Geoffrey E. Hinton,et al. Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[58] Christian H. Bischof,et al. ADIC: an extensible automatic differentiation tool for ANSI‐C , 1997, Softw. Pract. Exp..

[59] Xavier Leroy,et al. The effectiveness of type-based unboxing , 1997 .

[60] Siegfried M. Rump,et al. INTLAB - INTerval LABoratory , 1998, SCAN.

[61] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[62] L. Eon Bottou. Online Learning and Stochastic Approximations , 1998 .

[63] Thomas Kaminski,et al. Recipes for adjoint code construction , 1998, TOMS.

[64] P. Wedin,et al. Regularization tools for training large feed-forward neural networks using automatic differentiation ∗ , 1998 .

[65] Andrew W. Fitzgibbon,et al. Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[66] Léon Bottou,et al. On-line learning and stochastic approximations , 1999 .

[67] Nicol N. Schraudolph,et al. Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[68] A. Chambolle,et al. Inverse problems in image processing and image segmentation : some mathematical and numerical aspects , 2000 .

[69] I. Charpentier,et al. Efficient adjoint derivatives: application to the meteorological model meso-nh , 2000 .

[70] Bruce Christianson,et al. Application of automatic diffentiation to race car performance optimisation , 2000 .

[71] G. Haase,et al. Optimal sizing of industrial structural mechanics problems using AD , 2000 .

[72] Andreas Griewank,et al. Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[73] S. Forth,et al. Aerofoil optimisation via AD of a multigrid cell-vertex Euler flow solver , 2000 .

[74] Gerald J. Sussman,et al. Structure and interpretation of classical mechanics , 2001 .

[75] H. Martin Bücker,et al. Automatic differentiation for computational finance , 2002 .

[76] Scott Tremaine,et al. Structure and Interpretation of Classical Mechanics , 2002 .

[77] Christian H. Bischof,et al. Implementation of automatic differentiation tools , 2002, PEPM '02.

[78] Erich Kaltofen,et al. Computer algebra handbook , 2002 .

[79] Nicol N. Schraudolph,et al. Combining Conjugate Direction Methods with Stochastic Approximation of Gradients , 2003, AISTATS.

[80] Alain Dervieux,et al. Automatic Differentiation for Optimum Design, Applied to Sonic Boom Reduction , 2003, ICCSA.

[81] Andreas Griewank,et al. Introduction to Automatic Differentiation , 2003 .

[82] Andreas Griewank,et al. A mathematical view of automatic differentiation , 2003, Acta Numerica.

[83] Uwe Naumann,et al. Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph , 2004, Math. Program..

[84] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[85] Jens-Dominik Müller,et al. On the performance of discrete adjoint CFD codes using automatic differentiation , 2005 .

[86] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[87] M. Sambridge,et al. Automatic differentiation in geophysical inverse problems , 2005 .

[88] X. Yi. On Automatic Differentiation , 2005 .

[89] Barak A. Pearlmutter,et al. Perturbation Confusion and Referential Transparency:Correct Functional Implementation of Forward-Mode AD , 2005 .

[90] Shaun A. Forth. An efficient overloaded implementation of forward mode automatic differentiation in MATLAB , 2006, TOMS.

[91] Laurent Hascoët,et al. The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation , 2006, International Conference on Computational Science.

[92] Mark W. Schmidt,et al. Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[93] Louis B. Rall,et al. Perspectives on Automatic Differentiation: Past, Present, and Future? , 2006 .

[94] Uwe Naumann,et al. Computing Adjoints with the NAGWare Fortran 95 Compiler , 2006 .

[95] E. Dowell,et al. Using Automatic Differentiation to Create a Nonlinear Reduced Order Model of a Computational Fluid Dynamic Solver , 2006 .

[96] J.-F. Ostiguy,et al. Mxyzptlk: An efficient, native C++ differentiation engine , 2007, 2007 IEEE Particle Accelerator Conference (PAC).

[97] Andrea Walther,et al. Automatic differentiation of explicit Runge-Kutta methods for optimal control , 2007, Comput. Optim. Appl..

[98] Zhenzhen Liu,et al. Fast and Scalable Recurrent Neural Network Learning based on Stochastic Meta-Descent , 2007, 2007 American Control Conference.

[99] Horst Bischof,et al. Algorithmic Differentiation: Application to Variational Problems in Computer Vision , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100] Emil Slusanschi,et al. Automatic Differentiation of the General-Purpose Computational Fluid Dynamics Package FLUENT , 2007 .

[101] Li Yan,et al. Application of PID Controller Based on BP Neural Network Using Automatic Differentiation Method , 2008, ISNN.

[102] Barak A. Pearlmutter,et al. Using Polyvariant Union-Free Flow Analysis to Compile aHigher-Order Functional-Programming Language with aFirst-Class Derivative Operator to Efficient Fortran-like Code , 2008 .

[103] Laurent Hascoët,et al. TAPENADE for C , 2008 .

[104] Christian H. Bischof,et al. On the implementation of automatic differentiation tools , 2002, PEPM '02.

[105] James V. Burke,et al. Algorithmic Differentiation of Implicit Functions and Optimal Values , 2008 .

[106] Barak A. Pearlmutter,et al. Nesting forward-mode AD in a functional framework , 2008, High. Order Symb. Comput..

[107] Yi Cao,et al. Nonlinear system identification for predictive control using continuous time recurrent neural networks and automatic differentiation , 2008 .

[108] Barak A. Pearlmutter,et al. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[109] Bernhard Kainz,et al. Automatic Differentiation for GPU-Accelerated 2D/3D Registration , 2008 .

[110] Christopher D. Manning,et al. Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[111] Johannes Willkomm,et al. Introduction to Automatic Differentiation , 2009 .

[112] Jonathan Cohen,et al. Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[113] Andrea Walther,et al. Efficient Computation of Sparse Hessians Using Coloring and Automatic Differentiation , 2009, INFORMS J. Comput..

[114] D. G. Sotiropoulos,et al. A memoryless BFGS neural network training algorithm , 2009, 2009 7th IEEE International Conference on Industrial Informatics.

[115] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[116] Andrea Walther,et al. Getting Started with ADOL-C , 2009, Combinatorial Scientific Computing.

[117] L. Capriotti. Fast Greeks by Algorithmic Differentiation , 2010 .

[118] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[119] Johannes Willkomm,et al. Automatic Differentiation for Matlab , 2010 .

[120] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[121] Noah A. Smith,et al. Distributed Asynchronous Online Learning for Natural Language Processing , 2010, CoNLL.

[122] Kenneth Ruud,et al. Arbitrary-Order Density Functional Response Theory from Automatic Differentiation. , 2010, Journal of chemical theory and computation.

[123] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[124] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[125] Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[126] Clément Farabet,et al. Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[127] Radford M. Neal. MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[128] Noah D. Goodman,et al. Nonstandard Interpretations of Probabilistic Programs for Efficient Inference , 2011, NIPS.

[129] Andreas Griewank,et al. On the numerical stability of algorithmic differentiation , 2012, Computing.

[130] M. Girolami,et al. Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[131] Bruce Christianson. A Leibniz Notation for Automatic Differentiation , 2012 .

[132] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[133] Andreas Griewank,et al. Who Invented the Reverse Mode of Differentiation , 2012 .

[134] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[135] Pascal Vincent,et al. Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[136] Alex Pothen,et al. ColPack: Software for graph coloring and related problems in scientific computing , 2013, TOMS.