Data-Efficient Learning of Robotic Grasps From Human Preferences

The ability to grasp various types of objects from the environment is an important manipulation skill for autonomous robots. It poses a prerequisite for solving many real-world tasks, ranging from object handover to household tasks such as clearing the dish washer. However, the huge variety of objects in our environment presents a major obstacle for robots attempting to acquire grasping skills that match the abilities of humans. Although the field of robotic grasping has received great attention during the last decades, grasping arbitrary objects therefore remains a challenging task. In contrast to earlier work that was often restricted to analytic approaches in simulation, reinforcement learning techniques allow to learn from grasp experience by trial and error. While the use of reinforcement learning approaches has led to promising results, defining a reward function is still an open problem. Prior work learned a reward model from human feedback for a single grasp type. However, to capture the whole variety of objects in the real world, it is necessary to learn grasping motions across different grasp types. The goal of this thesis is to devise a reinforcement learning approach that enables the robot to learn how to grasp objects across different grasp types without prior knowledge of the reward function. Because robot operation is both time-consuming and costly, the algorithm should additionally be data-efficient. To achieve this goal, we leverage human feedback to learn multiple grasp policies using a hierarchical reinforcement learning approach. On the upper level, the robot chooses a grasp type and location based on the predicted reward of the resulting trajectory. Subsequently, a grasping motion is generated by the lower-level policy of the selected grasp type. In order to evaluate the outcome of a grasp, we introduce an additional reward model learned entirely from preference feedback. Bayesian optimization and active learning techniques are employed to reduce the number of feedback requests and achieve data-efficient learning. We examined the usefulness of our approach in various experiments on a ball throwing toy task and a simulated grasping task. We showed that learning an outcome reward model from preferences can improve the performance of the system. Moreover, we were able to reduce the amount of required feedback significantly by introducing an active learning criterion. When applied to robotic grasping, our approach was able to learn how to grasp few known objects using different grasp types. However, it failed when more objects were added to the task. Future work could improve the modeling of the reward function or the generation of grasp locations. Furthermore, additional experiments on a real robotic system are required to evaluate the performance of the proposed approach.

[1]  Danica Kragic,et al.  Data-Driven Grasp Synthesis—A Survey , 2013, IEEE Transactions on Robotics.

[2]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[3]  Luís Paulo Reis,et al.  Regularized covariance estimation for weighted maximum likelihood policy search methods , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[4]  Wei Chu,et al.  Preference learning with Gaussian processes , 2005, ICML.

[5]  Michèle Sebag,et al.  APRIL: Active Preference-learning based Reinforcement Learning , 2012, ECML/PKDD.

[6]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[7]  Thomas A. Funkhouser,et al.  The Princeton Shape Benchmark , 2004, Proceedings Shape Modeling Applications, 2004..

[8]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[9]  S. Schaal Dynamic Movement Primitives -A Framework for Motor Control in Humans and Humanoid Robotics , 2006 .

[10]  Oliver Kroemer,et al.  Active reward learning with a novel acquisition function , 2015, Auton. Robots.

[11]  Jan Peters,et al.  Experiments with Hierarchical Reinforcement Learning of Multiple Grasping Policies , 2016, ISER.

[12]  F. Mosteller Remarks on the method of paired comparisons: I. The least squares solution assuming equal standard deviations and equal correlations , 1951 .

[13]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[14]  L. Thurstone,et al.  A low of comparative judgement , 1927 .

[15]  Johannes Fürnkranz,et al.  Preference-Based Reinforcement Learning: A Preliminary Survey , 2013 .

[16]  Oliver Kroemer,et al.  Combining active learning and reactive control for robot grasping , 2010, Robotics Auton. Syst..

[17]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[18]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[19]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[20]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[21]  Harold J. Kushner,et al.  A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[22]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[23]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[24]  D. Lizotte Practical bayesian optimization , 2008 .

[25]  Karun B. Shimoga,et al.  Robot Grasp Synthesis Algorithms: A Survey , 1996, Int. J. Robotics Res..

[26]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[27]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[28]  Paul J. Besl,et al.  A Method for Registration of 3-D Shapes , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Stefan Schaal,et al.  Learning motion primitive goals for robust manipulation , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[31]  Ling Xu,et al.  Physical Human Interactive Guidance: Identifying Grasping Principles From Human-Planned Grasps , 2012, IEEE Trans. Robotics.

[32]  David J. C. MacKay,et al.  Bayesian Methods for Backpropagation Networks , 1996 .

[33]  David Hsu,et al.  Learning Dynamic Robot-to-Human Object Handover from Human Feedback , 2016, ISRR.

[34]  Anis Sahbani,et al.  An overview of 3D object grasp synthesis algorithms , 2012, Robotics Auton. Syst..

[35]  Wei Chu,et al.  Extensions of Gaussian Processes for Ranking : Semi-supervised and Active Learning , 2005 .

[36]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[37]  Antonio Morales,et al.  An active learning approach for assessing robot grasp reliability , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[38]  Jan Peters,et al.  Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[39]  Máximo A. Roa,et al.  Grasp quality measures: review and performance , 2014, Autonomous Robots.

[40]  John F. Canny,et al.  Planning optimal grasps , 1992, Proceedings 1992 IEEE International Conference on Robotics and Automation.

[41]  M. Arbib,et al.  Infant grasp learning: a computational model , 2004, Experimental Brain Research.

[42]  Alexander Herzog,et al.  Template-based learning of grasp selection , 2012, 2012 IEEE International Conference on Robotics and Automation.

[43]  Danica Kragic,et al.  The GRASP Taxonomy of Human Grasp Types , 2016, IEEE Transactions on Human-Machine Systems.

[44]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[45]  Manuel Lopes,et al.  Active learning of visual descriptors for grasping using non-parametric smoothed beta distributions , 2012, Robotics Auton. Syst..

[46]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[47]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[48]  Oliver Kroemer,et al.  Generalization of human grasping for multi-fingered robot hands , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[49]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[50]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[51]  Vijay Kumar,et al.  Robotic grasping and contact: a review , 2000, Proceedings 2000 ICRA. Millennium Conference. IEEE International Conference on Robotics and Automation. Symposia Proceedings (Cat. No.00CH37065).