论文信息 - Model-Value Inconsistency as a Signal for Epistemic Uncertainty - 字舞流文

Model-Value Inconsistency as a Signal for Epistemic Uncertainty

Using a model of the environment and a value function, an agent can construct many estimates of a state’s value, by unrolling the model for different lengths and bootstrapping with its value function. Our key insight is that one can treat this set of value estimates as a type of ensemble, which we call an implicit value ensemble (IVE). Consequently, the discrepancy between these estimates can be used as a proxy for the agent’s epistemic uncertainty; we term this signal model-value inconsistency or self-inconsistency for short. Unlike prior work which estimates uncertainty by training an ensemble of many models and/or value functions, this approach requires only the single model and value function which are already being learned in most model-based reinforcement learning algorithms. We provide empirical evidence in both tabular and function approximation settings from pixels that self-inconsistency is useful (i) as a signal for exploration, (ii) for acting safely under distribution shifts, and (iii) for robustifying value-based planning with a learned model.

Feryal M. P. Behbahani | T. Schaul | Simon Osindero | André Barreto | Angelos Filos | Gregory Farquhar | Diana Borsa | A. Friesen | Zita Marinho | Eszter V'ertes

[1] Rishabh Agarwal,et al. Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation , 2021, AAAI.

[2] Zita Marinho,et al. Self-Consistent Models and Values , 2021, NeurIPS.

[3] Tao Yu,et al. PlayVirtual: Augmenting Cycle-Consistent Virtual Trajectories for Reinforcement Learning , 2021, NeurIPS.

[4] Ivo Danihelka,et al. Muesli: Combining Improvements in Policy Optimization , 2021, ICML.

[5] Zheng Wen,et al. Reinforcement Learning, Bit by Bit , 2021, Found. Trends Mach. Learn..

[6] Clare Lyle,et al. On The Effect of Auxiliary Tasks on Representation Dynamics , 2021, AISTATS.

[7] S. Levine,et al. PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning , 2021, ICML.

[8] Marc G. Bellemare,et al. The Value-Improvement Path: Towards Better Representations for Reinforcement Learning , 2020, AAAI.

[9] Satinder Singh,et al. The Value Equivalence Principle for Model-Based Reinforcement Learning , 2020, NeurIPS.

[10] Jane X. Wang,et al. Temporal Difference Uncertainties as a Signal for Exploration , 2020, ArXiv.

[11] Xiao Ma,et al. Contrastive Variational Model-Based Reinforcement Learning for Complex Observations , 2020, ArXiv.

[12] Sergey Levine,et al. Can Autonomous Vehicles Identify, Recover From, and Adapt to Distribution Shifts? , 2020, ICML.

[13] Jasper Snoek,et al. Hyperparameter Ensembles for Robustness and Uncertainty Quantification , 2020, NeurIPS.

[14] Yuval Tassa,et al. dm_control: Software and Tasks for Continuous Control , 2020, Softw. Impacts.

[15] José Miguel Hernández-Lobato,et al. Depth Uncertainty in Neural Networks , 2020, NeurIPS.

[16] Michael W. Dusenberry,et al. Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[17] Pieter Abbeel,et al. Planning to Explore via Self-Supervised World Models , 2020, ICML.

[18] Tim Rocktäschel,et al. RIDE: Rewarding Impact-Driven Exploration for Procedurally-Generated Environments , 2020, ICLR.

[19] Pavel Izmailov,et al. Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[20] Fabio Viola,et al. Value-driven Hindsight Modelling , 2020, NeurIPS.

[21] Krzysztof Choromanski,et al. Ready Policy One: World Building Through Active Learning , 2020, ICML.

[22] Jimmy Ba,et al. Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[23] J. Schulman,et al. Leveraging Procedural Generation to Benchmark Reinforcement Learning , 2019, ICML.

[24] Demis Hassabis,et al. Mastering Atari, Go, chess and shogi by planning with a learned model , 2019, Nature.

[25] Matteo Hessel,et al. Off-Policy Actor-Critic with Shared Experience Replay , 2019, ICML.

[26] Rishabh Agarwal,et al. An Optimistic Perspective on Offline Reinforcement Learning , 2019, ICML.

[27] Sergey Levine,et al. Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model , 2019, NeurIPS.

[28] Mohamed H. Zaki,et al. Uncertainty in Neural Networks: Approximately Bayesian Ensembling , 2018, AISTATS.

[29] Martin A. Riedmiller,et al. Imagined Value Gradients: Model-Based Policy Optimization with Transferable Latent Dynamics Models , 2019, CoRL.

[30] Yarin Gal,et al. Generalizing from a few environments in safety-critical reinforcement learning , 2019, ArXiv.

[31] Aaron van den Oord,et al. Shaping Belief States with Generative Environment Models for RL , 2019, NeurIPS.

[32] Deepak Pathak,et al. Self-Supervised Exploration via Disagreement , 2019, ICML.

[33] Tian Tian,et al. MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .

[34] Yoshua Bengio,et al. Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[35] Andrew Gordon Wilson,et al. A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[36] Ruben Villegas,et al. Learning Latent Dynamics for Planning from Pixels , 2018, ICML.

[37] Wojciech Jaskowski,et al. Model-Based Active Exploration , 2018, ICML.

[38] Sham M. Kakade,et al. Plan Online, Learn Offline: Efficient Learning and Exploration via Model-Based Control , 2018, ICLR.

[39] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[40] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[41] Honglak Lee,et al. Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[42] Albin Cassirer,et al. Randomized Prior Functions for Deep Reinforcement Learning , 2018, NeurIPS.

[43] Sergey Levine,et al. Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[44] Pieter Abbeel,et al. Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[45] Shimon Whiteson,et al. TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning , 2017, ICLR.

[46] Michael I. Jordan,et al. Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[47] Gabriel Kalweit,et al. Uncertainty-driven Imagination for Continuous Deep Reinforcement Learning , 2017, CoRL.

[48] Marc G. Bellemare,et al. A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[49] Satinder Singh,et al. Value Prediction Network , 2017, NIPS.

[50] Alexei A. Efros,et al. Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[51] Kilian Q. Weinberger,et al. Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[52] Tom Schaul,et al. The Predictron: End-To-End Learning and Planning , 2016, ICML.

[53] Charles Blundell,et al. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[54] Nahum Shimkin,et al. Averaged-DQN: Variance Reduction and Stabilization for Deep Reinforcement Learning , 2016, ICML.

[55] Tom Schaul,et al. Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[56] Marc G. Bellemare,et al. Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[57] Tom Schaul,et al. Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[58] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[59] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[60] Benjamin Van Roy,et al. Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[61] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[62] J. Schreiber. Foundations Of Statistics , 2016 .

[63] C. Rasmussen,et al. Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[64] Honglak Lee,et al. Action-Conditional Video Prediction using Deep Networks in Atari Games , 2015, NIPS.

[65] Julien Cornebise,et al. Weight Uncertainty in Neural Network , 2015, ICML.

[66] Sergey Levine,et al. Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[67] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[68] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[69] Friedhelm Schwenker,et al. Neural Network Ensembles in Reinforcement Learning , 2013, Neural Processing Letters.

[70] Alex Graves,et al. Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[71] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[72] Ondrej Bojar,et al. Improving Translation Model by Monolingual Data , 2011, WMT@EMNLP.

[73] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[74] Jürgen Schmidhuber,et al. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[75] Shimon Whiteson,et al. A theoretical and empirical analysis of Expected Sarsa , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[76] John D. Hunter,et al. Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[77] Rémi Coulom,et al. Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[78] Richard S. Sutton,et al. Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[79] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[80] Marcus Hutter. Simulation Algorithms for Computational Systems Biology , 2017, Texts in Theoretical Computer Science. An EATCS Series.

[81] Jörg D. Wichard,et al. Building Ensembles with Heterogeneous Models , 2003 .

[82] Malcolm J. A. Strens,et al. A Bayesian Framework for Reinforcement Learning , 2000, ICML.

[83] David Andre,et al. Model based Bayesian Exploration , 1999, UAI.

[84] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[85] Robert Tibshirani,et al. A Comparison of Some Error Estimates for Neural Network Models , 1996, Neural Computation.

[86] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[87] Richard S. Sutton,et al. Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[88] Jürgen Schmidhuber,et al. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[89] Manfred Morari,et al. Model predictive control: Theory and practice - A survey , 1989, Autom..

[90] Pravin Varaiya,et al. Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[91] Christos H. Papadimitriou,et al. Games against nature , 1985, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[92] R. Bellman. A Markovian Decision Process , 1957 .