Tree-Based Batch Mode Reinforcement Learning

Reinforcement learning aims to determine an optimal control policy from interaction with a system or from observations gathered from a system. In batch mode, it can be achieved by approximating the so-called Q-function based on a set of four-tuples (xt, ut , rt, xt+1) where xt denotes the system state at time t, ut the control action taken, rt the instantaneous reward obtained and xt+1 the successor state of the system, and by determining the control policy from this Q-function. The Q-function approximation may be obtained from the limit of a sequence of (batch mode) supervised learning problems. Within this framework we describe the use of several classical tree-based supervised learning methods (CART, Kd-tree, tree bagging) and two newly proposed ensemble algorithms, namely extremely and totally randomized trees. We study their performances on several examples and find that the ensemble methods based on regression trees perform well in extracting relevant information about the optimal control policy from sets of four-tuples. In particular, the totally randomized trees give good results while ensuring the convergence of the sequence, whereas by relaxing the convergence constraint even better accuracy results are provided by the extremely randomized trees.

[1]  R. Bellman,et al.  Polynomial approximation—a new computational technique in dynamic programming: Allocation processes , 1962 .

[2]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[3]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[4]  C. Atkeson,et al.  Prioritized Sweeping : Reinforcement Learning with Less Data and Less Real Time , 1993 .

[5]  Mark W. Spong,et al.  Swing up control of the Acrobot , 1994, Proceedings of the 1994 IEEE International Conference on Robotics and Automation.

[6]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[7]  John Rust Using Randomization to Break the Curse of Dimensionality , 1997 .

[8]  Michael I. Jordan,et al.  Reinforcement Learning with Soft State Aggregation , 1994, NIPS.

[9]  Geoffrey J. Gordon Stable Function Approximation in Dynamic Programming , 1995, ICML.

[10]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[11]  Geoffrey J. Gordon Online Fitted Reinforcement Learning , 1995 .

[12]  Richard S. Sutton,et al.  Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[13]  Leemon C. Baird Residual Algorithms: Reinforcement Learning with Function Approximation , 1995, ICML.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Andrew McCallum,et al.  Reinforcement learning with selective perception and hidden state , 1996 .

[16]  Manuela M. Veloso,et al.  Tree Based Discretization for Continuous State Space Reinforcement Learning , 1998, AAAI/IAAI.

[17]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[18]  O. Hernández-Lerma,et al.  Discrete-time Markov control processes , 1999 .

[19]  Geoffrey J. Gordon,et al.  Approximate solutions to markov decision processes , 1999 .

[20]  Thomas G. Dietterich,et al.  Efficient Value Function Approximation Using Regression Trees , 1999 .

[21]  Junichiro Yoshimoto,et al.  Application of reinforcement learning to balancing of Acrobot , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).


[23]  Peter W. Glynn,et al.  Kernel-based reinforcement learning in average-cost problems , 2002, IEEE Trans. Autom. Control..

[24]  Leslie Pack Kaelbling,et al.  Practical Reinforcement Learning in Continuous Spaces , 2000, ICML.

[25]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[26]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[27]  Rosaleen J. Anderson Near optimal closed-loop control Application to electric power systems , 2003 .

[28]  J. Langford,et al.  Reducing T-step reinforcement learning to classifica-tion , 2003 .

[29]  Michail G. Lagoudakis,et al.  Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[30]  Pierre Geurts,et al.  Iteratively Extending Time Horizon Reinforcement Learning , 2003, ECML.

[31]  Jeff G. Schneider,et al.  Policy Search by Dynamic Programming , 2003, NIPS.

[32]  Michail G. Lagoudakis,et al.  Reinforcement Learning as Classification: Leveraging Modern Classifiers , 2003, ICML.

[33]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[34]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[35]  Andrew W. Moore,et al.  The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces , 2004, Machine Learning.

[36]  John N. Tsitsiklis,et al.  Feature-based methods for large scale dynamic programming , 2004, Machine Learning.

[37]  Leo Breiman,et al.  Bagging predictors , 2004, Machine Learning.

[38]  Justin A. Boyan,et al.  Technical Update: Least-Squares Temporal Difference Learning , 2002, Machine Learning.

[39]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[40]  D. Ernst,et al.  Approximate Value Iteration in the Reinforcement Learning Context. Application to Electrical Power System Control. , 2005 .

[41]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 2005, IEEE Transactions on Neural Networks.

[42]  Richard S. Sutton,et al.  Learning to Predict by the Methods of Temporal Differences , 1988, Machine Learning.

[43]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[44]  Yi Lin,et al.  Random Forests and Adaptive Nearest Neighbors , 2006 .

[45]  Liming Xiang,et al.  Kernel-Based Reinforcement Learning , 2006, ICIC.