Active exploration for robot parameter selection in episodic reinforcement learning

As the complexity of robots and other autonomous systems increases, it becomes more important that these systems can adapt and optimize their settings actively. However, such optimization is rarely trivial. Sampling from the system is often expensive in terms of time and other costs, and excessive sampling should therefore be avoided. The parameter space is also usually continuous and multi-dimensional. Given the inherent exploration-exploitation dilemma of the problem, we propose treating it as an episodic reinforcement learning problem. In this reinforcement learning framework, the policy is defined by the system's parameters and the rewards are given by the system's performance. The rewards accumulate during each episode of a task. In this paper, we present a method for efficiently sampling and optimizing in continuous multidimensional spaces. The approach is based on Gaussian process regression, which can represent continuous non-linear mappings from parameters to system performance. We employ an upper confidence bound policy, which explicitly manages the trade-off between exploration and exploitation. Unlike many other policies for this kind of problem, we do not rely on a discretization of the action space. The presented method was evaluated on a real robot. The robot had to learn grasping parameters in order to adapt its grasping execution to different objects. The proposed method was also tested on a more general gain tuning problem. The results of the experiments show that the presented method can quickly determine suitable parameters and is applicable to real online learning applications.

[1]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[2]  Ruzena Bajcsy,et al.  Active Learning for Vision-Based Robot Grasping , 2005, Machine Learning.

[3]  Ruben Martinez-Cantin Active Map Learning for Robots: Insights into Statistical Consistency , 2008 .

[4]  R. Agrawal The Continuum-Armed Bandit Problem , 1995 .

[5]  Justus H. Piater,et al.  Probabilistic Pose Recovery Using Learned Hierarchical Object Models , 2009, ICVW.

[6]  Andreas Krause,et al.  Gaussian Process Bandits without Regret: An Experimental Design Approach , 2009, ArXiv.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Rémi Munos,et al.  Algorithms for Infinitely Many-Armed Bandits , 2008, NIPS.

[9]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Nicolas Pugeault,et al.  Early cognitive vision: feedback mechanisms for the disambiguation of early visual representation , 2008 .

[11]  Peter Auer,et al.  Improved Rates for the Stochastic Continuum-Armed Bandit Problem , 2007, COLT.

[12]  Danica Kragic,et al.  Birth of the Object: Detection of Objectness and Extraction of Object Shape through Object-Action complexes , 2008, Int. J. Humanoid Robotics.

[13]  Zoubin Ghahramani,et al.  Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[14]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[15]  Robert D. Kleinberg Nearly Tight Bounds for the Continuum-Armed Bandit Problem , 2004, NIPS.

[16]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[17]  Peter Auer,et al.  Using upper confidence bounds for online learning , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[18]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[19]  Antonio Morales,et al.  An active learning approach for assessing robot grasp reliability , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).