Active Policy Learning for Robot Planning and Exploration under Uncertainty

This paper proposes a simulation-based active policy learning algorithm for finite-horizon, partially-observed sequential decision processes. The algorithm is tested in the domain of robot navigation and exploration under uncertainty. In such a setting, the expected cost, that must be minimized, is a function of the belief state (filtering distribution). This filtering distribution is in turn nonlinear and depends on an observation model with discontinuities. These discontinuities arise because the robot has a finite field of view and the environment may contain occluding obstacles. As a result, the expected cost is nondifferentiable and very expensive to simulate. The new algorithm overcomes the first difficulty and reduces the number of required simulations as follows. First, it assumes that we have carried out previous simulations which returned values of the expected cost for different corresponding policy parameters. Second, it fits a Gaussian process (GP) regression model to these values, so as to approximate the expected cost as a function of the policy parameters. Third, it uses the GP predicted mean and variance to construct a statistical measure that determines which policy parameters should be used in the next simulation. The process is then repeated using the new parameters and the newly gathered expected cost observation. Since the objective is to find the policy parameters that minimize the expected cost, this iterative active learning approach effectively trades-off between exploration (in regions where the GP variance is large) and exploitation (where the GP mean is low). In our experiments, a robot uses the proposed algorithm to plan an optimal path for accomplishing a series of tasks, while maximizing the information about its pose and map estimates. These estimates are obtained with a standard filter for simultaneous localization and mapping. Upon gathering new observations, the robot updates the state estimates and is able to replan a new path in the spirit of open-loop feedback control.

[1]  D. Krige A statistical approach to some basic mine valuation problems on the Witwatersrand, by D.G. Krige, published in the Journal, December 1951 : introduction by the author , 1951 .

[2]  Harold J. Kushner,et al.  A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise , 1963 .

[3]  Edward J. Sondik,et al.  The Optimal Control of Partially Observable Markov Processes over a Finite Horizon , 1973, Oper. Res..

[4]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[5]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[6]  C. D. Perttunen,et al.  Lipschitzian optimization without the Lipschitz constant , 1993 .

[7]  Andrew W. Moore,et al.  Memory-based Stochastic Optimization , 1995, NIPS.

[8]  K. Chaloner,et al.  Bayesian Experimental Design: A Review , 1995 .

[9]  Carlos H. Muravchik,et al.  Posterior Cramer-Rao bounds for discrete-time nonlinear filtering , 1998, IEEE Trans. Signal Process..

[10]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[11]  J. Cadre,et al.  Optimal observer trajectory in bearings-only tracking for manoeuvring sources , 1999 .

[12]  Michael I. Jordan,et al.  PEGASUS: A policy search method for large MDPs and POMDPs , 2000, UAI.

[13]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[14]  Donald R. Jones,et al.  A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[15]  J. Cadre,et al.  Planification for Terrain- Aided Navigation , 2002 .

[16]  Michael James Sasena,et al.  Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations. , 2002 .

[17]  D. Finkel,et al.  Direct optimization algorithm user guide , 2003 .

[18]  Sebastian Thrun,et al.  FastSLAM 2.0: An Improved Particle Filtering Algorithm for Simultaneous Localization and Mapping that Provably Converges , 2003, IJCAI.

[19]  Noah J. Cowan,et al.  Efficient Gradient Estimation for Motor Control Learning , 2002, UAI.

[20]  Y. Bar-Shalom,et al.  Multisensor resource deployment using posterior Cramer-Rao bounds , 2004, IEEE Transactions on Aerospace and Electronic Systems.

[21]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Eng Swee Siah,et al.  Fast parameter optimization of large-scale electromagnetic objects using DIRECT with Kriging metamodeling , 2004, IEEE Transactions on Microwave Theory and Techniques.

[24]  Marcel L. Hernandez,et al.  Optimal Sensor Trajectories in Bearings-Only Tracking , 2004 .

[25]  Thomas Hofmann,et al.  Kernel Methods for Missing Variables , 2005, AISTATS.

[26]  Nicholas Roy,et al.  Global A-Optimal Robot Exploration in SLAM , 2005, Proceedings of the 2005 IEEE International Conference on Robotics and Automation.

[27]  Robin J. Evans,et al.  Simulation-Based Optimal Sensor Scheduling with Application to Observer Trajectory Planning , 2005, CDC 2005.

[28]  Wolfram Burgard,et al.  Information Gain-based Exploration Using Rao-Blackwellized Particle Filters , 2005, Robotics: Science and Systems.

[29]  Gamini Dissanayake,et al.  Trajectory planning for multiple robots in bearing-only target localisation , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Paul Newman,et al.  Outdoor SLAM using visual appearance and laser ranging , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[31]  Luc Van Gool,et al.  SURF: Speeded Up Robust Features , 2006, ECCV.

[32]  Nicholas Roy,et al.  Using reinforcement learning to improve exploration trajectories for error minimization , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[33]  Stefan Schaal,et al.  Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[34]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.