Automatic differentiation in machine learning: a survey

Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD), also called algorithmic differentiation or simply “auto-diff”, is a family of techniques similar to but more general than backpropagation for efficiently and accurately evaluating derivatives of numeric functions expressed as computer programs. AD is a small but established field with applications in areas including computational fluid dynamics, atmospheric sciences, and engineering design optimization. Until very recently, the fields of machine learning and AD have largely been unaware of each other and, in some cases, have independently discovered each other’s results. Despite its relevance, general-purpose AD has been missing from the machine learning toolbox, a situation slowly changing with its ongoing adoption under the names “dynamic computational graphs” and “differentiable programming”. We survey the intersection of AD and machine learning, cover applications where AD has direct relevance, and address the main imple- mentation techniques. By precisely defining the main differentiation techniques and their interrelationships, we aim to bring clarity to the usage of the terms “autodiff”, “automatic differentiation”, and “symbolic differentiation” as these are encountered more and more in machine learning settings.

[1]  R. V. Gamkrelidze,et al.  THE THEORY OF OPTIMAL PROCESSES. I. THE MAXIMUM PRINCIPLE , 1960 .

[2]  A. E. Bryson,et al.  A Steepest-Ascent Method for Solving Optimum Programming Problems , 1962 .

[3]  R. E. Wengert,et al.  A simple automatic derivative evaluation program , 1964, Commun. ACM.

[4]  Arthur E. Bryson,et al.  Applied Optimal Control , 1969 .

[5]  Ludovít Molnár,et al.  Analytical differentiation on a digital computer , 1970, Kybernetika.

[6]  J. Meditch,et al.  Applied optimal control , 1972, IEEE Transactions on Automatic Control.

[7]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[8]  F. L. Bauer Computational Graphs and Rounding Error , 1974 .

[9]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[10]  D. Peng,et al.  A New Two-Constant Equation of State , 1976 .

[11]  S. Linnainmaa Taylor expansion of the accumulated rounding error , 1976 .

[12]  Berthold K. P. Horn Understanding Image Intensities , 1977, Artif. Intell..

[13]  George M. Siouris,et al.  Applied Optimal Control: Optimization, Estimation, and Control , 1979, IEEE Transactions on Systems, Man, and Cybernetics.

[14]  D. J. Bell,et al.  Numerical Methods for Unconstrained Optimization , 1979 .

[15]  B. Speelpenning Compiling Fast Partial Derivatives of Functions Given by Algorithms , 1980 .

[16]  Bengt Fornberg,et al.  Numerical Differentiation of Analytic Functions , 1981, TOMS.

[17]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[18]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[19]  Geoffrey E. Hinton,et al.  A general framework for parallel distributed processing , 1986 .

[20]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[21]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[22]  F W Pfeiffer,et al.  Automatic differentiation in prose , 1987, SGNM.

[23]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[24]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[25]  Léon Bottou,et al.  Sn: A simulator for connectionist models , 1988 .

[26]  Griewank,et al.  On automatic differentiation , 1988 .

[27]  R. D. Neidinger Automatic Differentiation and APL , 1989 .

[28]  John Peterson Untagged data in tagged environments: choosing optimal representations at compile time , 1989, FPCA.

[29]  Andrew W. Appel,et al.  Runtime tags aren't necessary , 1989, LISP Symb. Comput..

[30]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[31]  Bernard Widrow,et al.  30 years of adaptive neural networks: perceptron, Madaline, and backpropagation , 1990, Proc. IEEE.

[32]  Olin Shivers,et al.  Control-flow analysis of higher-order languages of taming lambda , 1991 .

[33]  Simon L. Peyton Jones,et al.  Unboxed Values as First Class Citizens in a Non-Strict Functional Language , 1991, FPCA.

[34]  David W. Juedes,et al.  A taxonomy of automatic differentiation tools , 1991 .

[35]  Claude Brezinski,et al.  Extrapolation methods - theory and practice , 1993, Studies in computational mathematics.

[36]  Lawrence C. Rich,et al.  Automatic differentiation in MATLAB , 1992 .

[37]  Peter Sestoft,et al.  Partial evaluation and automatic program generation , 1993, Prentice Hall international series in computer science.

[38]  Brian W. Kernighan,et al.  AMPL: A Modeling Language for Mathematical Programming , 1993 .

[39]  Philip Wadler,et al.  The Glasgow Haskell Compiler: a technical overview , 1993 .

[40]  B. Christianson Reverse accumulation and attractive fixed points , 1994 .

[41]  R. L. Hinkins,et al.  Parallel computation of automatic differentiation applied to magnetic field calculations , 1994 .

[42]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[43]  Jean-Yves Audibert Optimization for Machine Learning , 1995 .

[44]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[45]  Murray Hill Automatically Finding and Exploiting Partially Separable Structure in Nonlinear Programming Problems , 1996 .

[46]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[47]  Christian Bischof,et al.  Adifor 2.0: automatic differentiation of Fortran 77 programs , 1996 .

[48]  C. Bendtsen FADBAD, a flexible C++ package for automatic differentiation - using the forward and backward method , 1996 .

[49]  M. Berz,et al.  COSY INFINITY and Its Applications in Nonlinear Dynamics , 1996 .

[50]  D. Gay Automatically Finding and Exploiting Partially Separable Structure in Nonlinear Programming Problems , 1996 .

[51]  C. Bert,et al.  Differential Quadrature Method in Computational Mechanics: A Review , 1996 .

[52]  Jorge Nocedal,et al.  Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization , 1997, TOMS.

[53]  E. Tziperman,et al.  Finite Difference of Adjoint or Adjoint of Finite Difference , 1997 .

[54]  Christian Bischof,et al.  ADIC: an extensible automatic differentiation tool for ANSI-C , 1997 .

[55]  F. Potra,et al.  Sensitivity analysis for atmospheric chemistry models via automatic differentiation , 1997 .

[56]  M. Jerrell Automatic Differentiation and Interval Arithmetic for Estimation of Disequilibrium Models , 1997 .

[57]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[58]  Christian H. Bischof,et al.  ADIC: an extensible automatic differentiation tool for ANSI‐C , 1997, Softw. Pract. Exp..

[59]  Xavier Leroy,et al.  The effectiveness of type-based unboxing , 1997 .

[60]  Siegfried M. Rump,et al.  INTLAB - INTerval LABoratory , 1998, SCAN.

[61]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[62]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[63]  Thomas Kaminski,et al.  Recipes for adjoint code construction , 1998, TOMS.

[64]  P. Wedin,et al.  Regularization tools for training large feed-forward neural networks using automatic differentiation ∗ , 1998 .

[65]  Andrew W. Fitzgibbon,et al.  Bundle Adjustment - A Modern Synthesis , 1999, Workshop on Vision Algorithms.

[66]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[67]  Nicol N. Schraudolph,et al.  Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[68]  A. Chambolle,et al.  Inverse problems in image processing and image segmentation : some mathematical and numerical aspects , 2000 .

[69]  I. Charpentier,et al.  Efficient adjoint derivatives: application to the meteorological model meso-nh , 2000 .

[70]  Bruce Christianson,et al.  Application of automatic diffentiation to race car performance optimisation , 2000 .

[71]  G. Haase,et al.  Optimal sizing of industrial structural mechanics problems using AD , 2000 .

[72]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[73]  S. Forth,et al.  Aerofoil optimisation via AD of a multigrid cell-vertex Euler flow solver , 2000 .

[74]  Gerald J. Sussman,et al.  Structure and interpretation of classical mechanics , 2001 .

[75]  H. Martin Bücker,et al.  Automatic differentiation for computational finance , 2002 .

[76]  Scott Tremaine,et al.  Structure and Interpretation of Classical Mechanics , 2002 .

[77]  Christian H. Bischof,et al.  Implementation of automatic differentiation tools , 2002, PEPM '02.

[78]  Erich Kaltofen,et al.  Computer algebra handbook , 2002 .

[79]  Nicol N. Schraudolph,et al.  Combining Conjugate Direction Methods with Stochastic Approximation of Gradients , 2003, AISTATS.

[80]  Alain Dervieux,et al.  Automatic Differentiation for Optimum Design, Applied to Sonic Boom Reduction , 2003, ICCSA.

[81]  Andreas Griewank,et al.  Introduction to Automatic Differentiation , 2003 .

[82]  Andreas Griewank,et al.  A mathematical view of automatic differentiation , 2003, Acta Numerica.

[83]  Uwe Naumann,et al.  Optimal accumulation of Jacobian matrices by elimination methods on the dual computational graph , 2004, Math. Program..

[84]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[85]  Jens-Dominik Müller,et al.  On the performance of discrete adjoint CFD codes using automatic differentiation , 2005 .

[86]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[87]  M. Sambridge,et al.  Automatic differentiation in geophysical inverse problems , 2005 .

[88]  X. Yi On Automatic Differentiation , 2005 .

[89]  Barak A. Pearlmutter,et al.  Perturbation Confusion and Referential Transparency:Correct Functional Implementation of Forward-Mode AD , 2005 .

[90]  Shaun A. Forth An efficient overloaded implementation of forward mode automatic differentiation in MATLAB , 2006, TOMS.

[91]  Laurent Hascoët,et al.  The Data-Flow Equations of Checkpointing in Reverse Automatic Differentiation , 2006, International Conference on Computational Science.

[92]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[93]  Louis B. Rall,et al.  Perspectives on Automatic Differentiation: Past, Present, and Future? , 2006 .

[94]  Uwe Naumann,et al.  Computing Adjoints with the NAGWare Fortran 95 Compiler , 2006 .

[95]  E. Dowell,et al.  Using Automatic Differentiation to Create a Nonlinear Reduced Order Model of a Computational Fluid Dynamic Solver , 2006 .

[96]  J.-F. Ostiguy,et al.  Mxyzptlk: An efficient, native C++ differentiation engine , 2007, 2007 IEEE Particle Accelerator Conference (PAC).

[97]  Andrea Walther,et al.  Automatic differentiation of explicit Runge-Kutta methods for optimal control , 2007, Comput. Optim. Appl..

[98]  Zhenzhen Liu,et al.  Fast and Scalable Recurrent Neural Network Learning based on Stochastic Meta-Descent , 2007, 2007 American Control Conference.

[99]  Horst Bischof,et al.  Algorithmic Differentiation: Application to Variational Problems in Computer Vision , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[100]  Emil Slusanschi,et al.  Automatic Differentiation of the General-Purpose Computational Fluid Dynamics Package FLUENT , 2007 .

[101]  Li Yan,et al.  Application of PID Controller Based on BP Neural Network Using Automatic Differentiation Method , 2008, ISNN.

[102]  Barak A. Pearlmutter,et al.  Using Polyvariant Union-Free Flow Analysis to Compile aHigher-Order Functional-Programming Language with aFirst-Class Derivative Operator to Efficient Fortran-like Code , 2008 .

[103]  Laurent Hascoët,et al.  TAPENADE for C , 2008 .

[104]  Christian H. Bischof,et al.  On the implementation of automatic differentiation tools , 2002, PEPM '02.

[105]  James V. Burke,et al.  Algorithmic Differentiation of Implicit Functions and Optimal Values , 2008 .

[106]  Barak A. Pearlmutter,et al.  Nesting forward-mode AD in a functional framework , 2008, High. Order Symb. Comput..

[107]  Yi Cao,et al.  Nonlinear system identification for predictive control using continuous time recurrent neural networks and automatic differentiation , 2008 .

[108]  Barak A. Pearlmutter,et al.  Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[109]  Bernhard Kainz,et al.  Automatic Differentiation for GPU-Accelerated 2D/3D Registration , 2008 .

[110]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.

[111]  Johannes Willkomm,et al.  Introduction to Automatic Differentiation , 2009 .

[112]  Jonathan Cohen,et al.  Title: A Fast Double Precision CFD Code using CUDA , 2009 .

[113]  Andrea Walther,et al.  Efficient Computation of Sparse Hessians Using Coloring and Automatic Differentiation , 2009, INFORMS J. Comput..

[114]  D. G. Sotiropoulos,et al.  A memoryless BFGS neural network training algorithm , 2009, 2009 7th IEEE International Conference on Industrial Informatics.

[115]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[116]  Andrea Walther,et al.  Getting Started with ADOL-C , 2009, Combinatorial Scientific Computing.

[117]  L. Capriotti Fast Greeks by Algorithmic Differentiation , 2010 .

[118]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[119]  Johannes Willkomm,et al.  Automatic Differentiation for Matlab , 2010 .

[120]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[121]  Noah A. Smith,et al.  Distributed Asynchronous Online Learning for Natural Language Processing , 2010, CoNLL.

[122]  Kenneth Ruud,et al.  Arbitrary-Order Density Functional Response Theory from Automatic Differentiation. , 2010, Journal of chemical theory and computation.

[123]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[124]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[125]  Radford M. Neal Probabilistic Inference Using Markov Chain Monte Carlo Methods , 2011 .

[126]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[127]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[128]  Noah D. Goodman,et al.  Nonstandard Interpretations of Probabilistic Programs for Efficient Inference , 2011, NIPS.

[129]  Andreas Griewank,et al.  On the numerical stability of algorithmic differentiation , 2012, Computing.

[130]  M. Girolami,et al.  Riemann manifold Langevin and Hamiltonian Monte Carlo methods , 2011, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[131]  Bruce Christianson A Leibniz Notation for Automatic Differentiation , 2012 .

[132]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[133]  Andreas Griewank,et al.  Who Invented the Reverse Mode of Differentiation , 2012 .

[134]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[135]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[136]  Alex Pothen,et al.  ColPack: Software for graph coloring and related problems in scientific computing , 2013, TOMS.

[137]  Jeffrey Mark Siskind,et al.  Felzenszwalb-Baum-Welch: Event Detection by Changing Appearance , 2013, ArXiv.

[138]  Jeffrey Mark Siskind,et al.  Grounded Language Learning from Video Described with Sentences , 2013, ACL.

[139]  Daniel Cohen-Or,et al.  Geosemantic Snapping for Sketch‐Based Modeling , 2013, Comput. Graph. Forum.

[140]  Noah D. Goodman The principles and practice of probabilistic programming , 2013, POPL.

[141]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[142]  Laurent Hascoët,et al.  The Tapenade automatic differentiation tool: Principles, model, and specification , 2013, TOMS.

[143]  Noah D. Goodman,et al.  Learning Stochastic Inverses , 2013, NIPS.

[144]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[145]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[146]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[147]  Thomas A. Henzinger,et al.  Probabilistic programming , 2014, FOSE.

[148]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[149]  Varun Ramakrishna,et al.  User-Specific Hand Modeling from Monocular Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[150]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[151]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[152]  Michael J. Black,et al.  OpenDR: An Approximate Differentiable Renderer , 2014, ECCV.

[153]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[154]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[155]  T. van Amelsvoort Bridging the Gap , 2014, Tijdschrift voor psychiatrie.

[156]  Ilker Yildirim Efficient and robust analysis-by-synthesis in vision : A computational framework , behavioral tests , and modeling neuronal representations , 2015 .

[157]  Bob Carpenter,et al.  The Stan Math Library: Reverse-Mode Automatic Differentiation in C++ , 2015, ArXiv.

[158]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[159]  Christopher R'e,et al.  Caffe con Troll: Shallow Ideas to Speed Up Deep Learning , 2015, DanaC@SIGMOD.

[160]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[161]  Joshua B. Tenenbaum,et al.  Efficient analysis-by-synthesis in vision: A computational framework, behavioral tests, and modeling neuronal representations , 2015, Annual Meeting of the Cognitive Science Society.

[162]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[163]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[164]  Emanuel Todorov,et al.  Graphical Newton , 2015, ArXiv.

[165]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[166]  Yoshua Bengio,et al.  BinaryConnect: Training Deep Neural Networks with binary weights during propagations , 2015, NIPS.

[167]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[168]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[169]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[170]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[171]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[172]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[173]  Joshua B. Tenenbaum,et al.  Picture: A probabilistic programming language for scene perception , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[174]  David Eichelberger Computer Algebra Handbook Foundations Applications Systems , 2016 .

[175]  Barak A. Pearlmutter,et al.  Efficient Implementation of a Higher-Order Language with Built-In AD , 2016, ArXiv.

[176]  Amit Agarwal,et al.  CNTK: Microsoft's Open-Source Deep-Learning Toolkit , 2016, KDD.

[177]  Emil Slusanschi,et al.  ADiJaC -- Automatic Differentiation of Java Classfiles , 2016, ACM Trans. Math. Softw..

[178]  Geoffrey E. Hinton,et al.  Attend, Infer, Repeat: Fast Scene Understanding with Generative Models , 2016, NIPS.

[179]  Barak A. Pearlmutter,et al.  Tricks from Deep Learning , 2016, ArXiv.

[180]  Richard Socher,et al.  Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.

[181]  Wojciech Zaremba,et al.  Learning Simple Algorithms from Examples , 2015, ICML.

[182]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[183]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[184]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[185]  Zuzana Kukelova,et al.  A Benchmark of Selected Algorithmic Differentiation Tools on Some Problems in Machine Learning and Computer Vision , 2016 .

[186]  Miles Lubin,et al.  Forward-Mode Automatic Differentiation in Julia , 2016, ArXiv.

[187]  John Salvatier,et al.  Probabilistic programming in Python using PyMC3 , 2016, PeerJ Comput. Sci..

[188]  Barak A. Pearlmutter,et al.  DiffSharp: An AD Library for .NET Languages , 2016, ArXiv.

[189]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[190]  Ryan P. Adams,et al.  Composing graphical models with neural networks for structured representations and fast inference , 2016, NIPS.

[191]  Noah D. Goodman,et al.  Deep Amortized Inference for Probabilistic Programs , 2016, ArXiv.

[192]  Dustin Tran,et al.  Edward: A library for probabilistic modeling, inference, and criticism , 2016, ArXiv.

[193]  Dougal Maclaurin,et al.  Modeling, Inference and Optimization With Composable Differentiable Procedures , 2016 .

[194]  Naman Agarwal,et al.  Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[195]  Alex Graves,et al.  Memory-Efficient Backpropagation Through Time , 2016, NIPS.

[196]  Yu Hai-na,et al.  Application of PID Controller Based on BP Neural Network in Temperature Control of Aquaculture Greenhouse , 2016 .

[197]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[198]  Frank D. Wood,et al.  Learning Disentangled Representations with Semi-Supervised Deep Generative Models , 2017, NIPS.

[199]  Philipp Koehn,et al.  Neural Machine Translation , 2017, ArXiv.

[200]  Dustin Tran,et al.  Automatic Differentiation Variational Inference , 2016, J. Mach. Learn. Res..

[201]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[202]  Dustin Tran,et al.  Deep Probabilistic Programming , 2017, ICLR.

[203]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[204]  Jascha Sohl-Dickstein,et al.  REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models , 2017, NIPS.

[205]  Dan Moldovan,et al.  Tangent: Automatic Differentiation Using Source Code Transformation in Python , 2017, ArXiv.

[206]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[207]  Kenneth O. Stanley,et al.  Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning , 2017, ArXiv.

[208]  Frank D. Wood,et al.  Inference Compilation and Universal Probabilistic Programming , 2016, AISTATS.

[209]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[210]  Barak A. Pearlmutter,et al.  Divide-and-conquer checkpointing for arbitrary programs with no user annotation , 2017, Optim. Methods Softw..

[211]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[212]  David Duvenaud,et al.  Backpropagation through the Void: Optimizing control variates for black-box gradient estimation , 2017, ICLR.

[213]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[214]  Barak A. Pearlmutter,et al.  Perturbation confusion in forward automatic differentiation of higher-order functions , 2012, Journal of Functional Programming.

[215]  Enate,et al.  Stochastic volatility: Bayesian computation using automatic differentiation and the extended Kalman filter , 2003 .