A review of optimistic planning in Markov decision processes

We review a class of online planning algorithms for deterministic and stochastic optimal control problems, modeled as Markov decision processes. At each discrete time step, these algorithms maximize the predicted value of planning policies from the current state, and apply the first action of the best policy found. An overall receding-horizon algorithm results, which can also be seen as a type of model-predictive control. The space of planning policies is explored optimistically, focusing on areas with largest upper bounds on the value - or upper confidence bounds, in the stochastic case. The resulting optimistic planning framework integrates several types of optimism previously used in planning, optimization, and reinforcement learning, in order to obtain several intuitive algorithms with good performance guarantees. We describe in detail three recent such algorithms, outline the theoretical guarantees on their performance, and illustrate their behavior in a numerical example.

[1]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[2]  Rémi Munos,et al.  Optimistic Planning of Deterministic Systems , 2008, EWRL.

[3]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[4]  Rémi Munos,et al.  Open Loop Optimistic Planning , 2010, COLT.

[5]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[6]  Rémi Munos,et al.  Bandit Algorithms for Tree Search , 2007, UAI.

[7]  Louis Wehenkel,et al.  Planning under uncertainty, ensembles of disturbance trees and kernelized discrete action spaces , 2009, 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning.

[8]  Nils J. Nilsson,et al.  Principles of Artificial Intelligence , 1980, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Steven M. LaValle,et al.  Planning algorithms , 2006 .

[10]  Yishay Mansour,et al.  A Sparse Sampling Algorithm for Near-Optimal Planning in Large Markov Decision Processes , 1999, Machine Learning.

[11]  Michael L. Littman,et al.  Sample-Based Planning for Continuous Action Markov Decision Processes , 2011, ICAPS.

[12]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[13]  Bart De Schutter,et al.  Optimistic planning for sparsely stochastic systems , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[14]  Frédérick Garcia,et al.  On-Line Search for Solving Markov Decision Processes via Heuristic Sampling , 2004, ECAI.

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Olivier Teytaud,et al.  Modification of UCT with Patterns in Monte-Carlo Go , 2006 .

[17]  Bart De Schutter,et al.  Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[18]  Lucian Busoniu,et al.  Optimistic planning for Markov decision processes , 2012, AISTATS.

[19]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[20]  Bart De Schutter,et al.  Approximate dynamic programming with a fuzzy parameterization , 2010, Autom..

[21]  Thomas J. Walsh,et al.  Integrating Sample-Based Planning and Model-Based Reinforcement Learning , 2010, AAAI.