Exploration Bonuses and Dual Control

Finding the Bayesian balance between exploration and exploitation in adaptive optimal control is in general intractable. This paper shows how to compute suboptimal estimates based on a certainty equivalence approximation (Cozzolino, Gonzalez-Zubieta Miller, 1965) arising from a form of dual control. This systematizes and extends existing uses of exploration bonuses in reinforcement learning (Sutton, 1990). The approach has two components: a statistical model of uncertainty in the world and a way of turning this into exploratory behavior. This general approach is applied to two-dimensional mazes with moveable barriers and its performance is compared with Sutton‘s DYNA system.

[1]  H. Simon,et al.  A Behavioral Model of Rational Choice , 1955 .

[2]  H. Simon,et al.  Rational choice and the structure of the environment. , 1956, Psychological review.

[3]  S. Dreyfus Dynamic Programming and the Calculus of Variations , 1960 .

[4]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[5]  L. Meier Combined optimal control and estimation. , 1965 .

[6]  John M Gozzolino,et al.  MARKOVIAN DECISION PROCESSES WITH UNCERTAIN TRANSITION PROBABILITIES , 1965 .

[7]  C. Striebel Sufficient statistics in the optimum control of stochastic systems , 1965 .

[8]  A. G. Butkovskiy,et al.  Optimal control of systems , 1966 .

[9]  R. Rishel Necessary and Sufficient Dynamic Programming Conditions for Continuous Time Stochastic Optimal Control , 1970 .

[10]  Yaakov Bar-Shalom,et al.  An actively adaptive control for linear systems with random parameters via the dual control approach , 1972, CDC 1972.

[11]  W. J. Studden,et al.  Theory Of Optimal Experiments , 1972 .

[12]  Y. Bar-Shalom,et al.  Wide-sense adaptive dual control for nonlinear stochastic systems , 1973 .

[13]  M. Athans,et al.  Some properties of the dual adaptive stochastic control algorithm , 1981 .

[14]  Mitsuo Sato,et al.  Learning control of finite Markov chains with unknown transition probabilities , 1982 .

[15]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[16]  A. Barto,et al.  Learning and Sequential Decision Making , 1989 .

[17]  Alan D. Christiansen,et al.  Learning reliable manipulation strategies without initial physical models , 1990, Proceedings., IEEE International Conference on Robotics and Automation.

[18]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19]  M. Gabriel,et al.  Learning and Computational Neuroscience: Foundations of Adaptive Networks , 1990 .

[20]  J. Urgen Schmidhuber,et al.  Adaptive confidence and adaptive curiosity , 1991, Forschungsberichte, TU Munich.

[21]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[22]  W. Lovejoy A survey of algorithmic methods for partially observed Markov decision processes , 1991 .

[23]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[24]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[25]  David A. Cohn,et al.  Neural Network Exploration Using Optimal Experiment Design , 1993, NIPS.

[26]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[27]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[28]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[29]  Andrew W. Moore,et al.  Prioritized sweeping: Reinforcement learning with less data and less time , 2004, Machine Learning.