Kalman meets Bellman: Improving Policy Evaluation through Value Tracking

Policy evaluation is a key process in Reinforcement Learning (RL). It assesses a given policy by estimating the corresponding value function. When using parameterized value functions, common approaches minimize the sum of squared Bellman temporal-difference errors and receive a point-estimate for the parameters. Kalman-based and Gaussian-processes based frameworks were suggested to evaluate the policy by treating the value as a random variable. These frameworks can learn uncertainties over the value parameters and exploit them for policy exploration. When adopting these frameworks to solve deep RL tasks, several limitations are revealed: excessive computations in each optimization step, difficulty with handling batches of samples which slows training and the effect of memory in stochastic environments which prevents off-policy learning. In this work, we discuss these limitations and propose to overcome them by an alternative general framework, based on the extended Kalman filter. We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) that can be incorporated as a policy evaluation component in policy optimization algorithms. KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties. We analyze the properties of KOVA and present its performance on deep RL control tasks.

[1]  S. Haykin Kalman Filtering and Neural Networks , 2001 .

[2]  Jan Peters,et al.  Policy evaluation with temporal differences: a survey and comparison , 2015, J. Mach. Learn. Res..

[3]  Richard S. Sutton,et al.  Weighted importance sampling for off-policy learning with linear function approximation , 2014, NIPS.

[4]  Bin Wang,et al.  A Kalman filter-based actor-critic learning approach , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[5]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[6]  R. Sutton,et al.  Gradient temporal-difference learning algorithms , 2011 .

[7]  Liang Lin,et al.  Batch Kalman Normalization: Towards Training Deep Neural Networks with Micro-Batches , 2018, ArXiv.

[8]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[9]  Leif H. Finkel,et al.  A Neural Implementation of the Kalman Filter , 2009, NIPS.

[10]  Petros G. Voulgaris,et al.  On optimal ℓ∞ to ℓ∞ filtering , 1995, Autom..

[11]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[12]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[13]  Lawrence Carin,et al.  Learning Structural Weight Uncertainty for Sequential Decision-Making , 2017, AISTATS.

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  James Vuckovic,et al.  Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization , 2018, ArXiv.

[16]  Robert Fitch,et al.  Tracking value function dynamics to improve reinforcement learning with piecewise linear function approximation , 2007, ICML '07.

[17]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[18]  Matthieu Geist,et al.  Sample Efficient On-Line Learning of Optimal Dialogue Policies with Kalman Temporal Differences , 2011, IJCAI.

[19]  Jeffrey K. Uhlmann,et al.  New extension of the Kalman filter to nonlinear systems , 1997, Defense, Security, and Sensing.

[20]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[21]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[22]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[23]  Sebastian Tschiatschek,et al.  Successor Uncertainties: exploration and uncertainty in temporal difference learning , 2018, NeurIPS.

[24]  Shie Mannor,et al.  Shallow Updates for Deep Reinforcement Learning , 2017, NIPS.

[25]  Martha White,et al.  Accelerated Gradient Temporal Difference Learning , 2016, AAAI.

[26]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[27]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[28]  Finale Doshi-Velez,et al.  Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks , 2016, ICLR.

[29]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[30]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[31]  Shie Mannor,et al.  Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[32]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty for Active Learning and Reliable Reinforcement Learning in Stochastic Systems , 2017, ArXiv.

[33]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  A.H. Haddad,et al.  Applied optimal estimation , 1976, Proceedings of the IEEE.

[36]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[37]  David Choi,et al.  A Generalized Kalman Filter for Fixed Point Approximation and Efficient Temporal-Difference Learning , 2001, Discret. Event Dyn. Syst..

[38]  Brian Karrer,et al.  The decoupled extended Kalman filter for dynamic exponential-family factorization models , 2018, J. Mach. Learn. Res..

[39]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[40]  Finale Doshi-Velez,et al.  Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning , 2017, ICML.

[41]  Michal Valko,et al.  Bayesian Policy Gradient and Actor-Critic Algorithms , 2016, J. Mach. Learn. Res..

[42]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[43]  Shie Mannor,et al.  Reinforcement learning with Gaussian processes , 2005, ICML.

[44]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[45]  Takao Miura,et al.  Model Selection Based on Kalman Temporal Differences Learning , 2017, 2017 IEEE 3rd International Conference on Collaboration and Internet Computing (CIC).

[46]  Zheng Wen,et al.  Deep Exploration via Randomized Value Functions , 2017, J. Mach. Learn. Res..

[47]  Rudolph van der Merwe,et al.  The unscented Kalman filter for nonlinear estimation , 2000, Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).

[48]  Rudolph van der Merwe,et al.  Sigma-point kalman filters for probabilistic inference in dynamic state-space models , 2004 .

[49]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[50]  Benjamin Van Roy,et al.  Generalization and Exploration via Randomized Value Functions , 2014, ICML.

[51]  Matthieu Geist,et al.  Kalman Temporal Differences , 2010, J. Artif. Intell. Res..

[52]  Qiang Liu,et al.  A Kernel Loss for Solving the Bellman Equation , 2019, NeurIPS.

[53]  Richard S. Sutton,et al.  Directly Estimating the Variance of the {\lambda}-Return Using Temporal-Difference Methods , 2018 .

[54]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[55]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[56]  Richard S. Sutton,et al.  On the role of tracking in stationary environments , 2007, ICML '07.

[57]  Yann Ollivier,et al.  Online natural gradient as a Kalman filter , 2017, 1703.00209.

[58]  Arash Givchi,et al.  Quasi Newton Temporal Difference Learning , 2014, ACML.

[59]  Dimitri P. Bertsekas,et al.  Incremental Least Squares Methods and the Extended Kalman Filter , 1996, SIAM J. Optim..

[60]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[61]  Simo Särkkä,et al.  Bayesian Filtering and Smoothing , 2013, Institute of Mathematical Statistics textbooks.

[62]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[63]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[64]  Albin Cassirer,et al.  Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[65]  Lee A. Feldkamp,et al.  Decoupled extended Kalman filter training of feedforward layered networks , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[66]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[67]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.