Optimistic planning for Markov decision processes

The reinforcement learning community has recently intensified its interest in online planning methods, due to their relative independence on the state space size. However, tight near-optimality guarantees are not yet available for the general case of stochastic Markov decision processes and closed-loop, state-dependent planning policies. We therefore consider an algorithm related to AO* that optimistically explores a tree representation of the space of closed-loop policies, and we analyze the near-optimality of the action it returns after n tree node expansions. While this optimistic planning requires a finite number of actions and possible next states for each transition, its asymptotic performance does not depend directly on these numbers, but only on the subset of nodes that significantly impact near-optimal policies. We characterize this set by introducing a novel measure of problem complexity, called the near-optimality exponent. Specializing the exponent and performance bound for some interesting classes of MDPs illustrates the algorithm works better when there are fewer near-optimal policies and less uniform transition probabilities.

[1]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[2]  Aleksandra Eric,et al.  A Heuristic Search Algorithm for Markov Decision Problems , 1999 .

[3]  Frédérick Garcia,et al.  On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[4]  Csaba Szepesvári,et al.  Efficient approximate planning in continuous space Markovian Decision Problems , 2001, AI Commun..

[5]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[6]  Louis Wehenkel,et al.  Lazy Planning under Uncertainty by Optimizing Decisions on an Ensemble of Incomplete Disturbance Trees , 2008, EWRL.

[7]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[8]  Carla Bosia,et al.  Supplementary Material S1 , 2011 .

[9]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.

[10]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[11]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[12]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[13]  Michael L. Littman,et al.  Sample-Based Planning for Continuous Action Markov Decision Processes , 2011, ICAPS.

[14]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[15]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[16]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[17]  Bart De Schutter,et al.  Optimistic planning for sparsely stochastic systems , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[18]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[19]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[20]  Rémi Munos,et al.  Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[21]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[22]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[23]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[24]  Louis Wehenkel,et al.  Planning under uncertainty, ensembles of disturbance trees and kernelized discrete action spaces , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[25]  Sebastian Thrun,et al.  Planning for Markov Decision Processes with Sparse Stochasticity , 2004, NIPS.

[26]  Benjamin Van Roy,et al.  Feature-based methods for large scale dynamic programming , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[27]  Sylvain Gelly,et al.  Modifications of UCT and sequence-like simulations for Monte-Carlo Go , 2007, 2007 IEEE Symposium on Computational Intelligence and Games.

[28]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[29]  John Rust Numerical dynamic programming in economics , 1996 .

[30]  K. Taira Proof of Theorem 1.3 , 2004 .

[31]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.