Near-Optimal Reinforcement Learning in Polynomial Time

We present new algorithms for reinforcement learning and prove that they have polynomial bounds on the resources required to achieve near-optimal return in general Markov decision processes. After observing that the number of actions required to approach the optimal return is lower bounded by the mixing time T of the optimal policy (in the undiscounted case) or by the horizon time T (in the discounted case), we then give algorithms requiring a number of actions and total computation time that are only polynomial in T and the number of states and actions, for both the undiscounted and discounted cases. An interesting aspect of our algorithms is their explicit handling of the Exploration-Exploitation trade-off.

[1]  Pravin Varaiya,et al.  Stochastic Systems: Estimation, Identification, and Adaptive Control , 1986 .

[2]  Dimitri P. Bertsekas,et al.  Dynamic Programming: Deterministic and Stochastic Models , 1987 .

[3]  Richard S. Sutton,et al.  Sequential Decision Problems and Neural Networks , 1989, NIPS 1989.

[4]  John N. Tsitsiklis,et al.  Parallel and distributed computation , 1989 .

[5]  A. Jalali,et al.  A distributed asynchronous algorithm for expected average cost dynamic programming , 1990, 29th IEEE Conference on Decision and Control.

[6]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[7]  Lonnie Chrisman,et al.  Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach , 1992, AAAI.

[8]  D. Sofge THE ROLE OF EXPLORATION IN LEARNING CONTROL , 1992 .

[9]  Donald A. Sofge,et al.  Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches , 1992 .

[10]  C. Atkeson,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[11]  Alistair Sinclair,et al.  Algorithms for Random Generation and Counting: A Markov Chain Approach , 1993, Progress in Theoretical Computer Science.

[12]  Andrew G. Barto,et al.  Convergence of Indirect Adaptive Asynchronous Value Iteration Algorithms , 1993, NIPS.

[13]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[14]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  C. Fiechter Eecient Reinforcement Learning , 1994 .

[17]  Michael I. Jordan,et al.  Reinforcement Learning Algorithm for Partially Observable Markov Decision Problems , 1994, NIPS.

[18]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[19]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[20]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[21]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[22]  Leslie Pack Kaelbling,et al.  Learning Policies for Partially Observable Environments: Scaling Up , 1997, ICML.

[23]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[24]  Richard S. Sutton,et al.  Reinforcement Learning with Replacing Eligibility Traces , 2005, Machine Learning.

[25]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[26]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[27]  Lawrence K. Saul,et al.  Learning curve bounds for a Markov decision process with undiscounted rewards , 1996, COLT '96.

[28]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[29]  Claude-Nicolas Fiechter Expected Mistake Bound Model for On-Line Reinforcement Learning , 1997, ICML.

[30]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[31]  Michael Kearns,et al.  Efficient Reinforcement Learning in Factored MDPs , 1999, IJCAI.

[32]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[33]  Peter Dayan,et al.  Analytical Mean Squared Error Curves for Temporal Difference Learning , 1996, Machine Learning.

[34]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[35]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[36]  Manfred K. Warmuth,et al.  On the Worst-Case Analysis of Temporal-Difference Learning Algorithms , 2005, Machine Learning.

[37]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[38]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[39]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[40]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.