Assessing Transferability From Simulation to Reality for Reinforcement Learning

Learning robot control policies from physics simulations is of great interest to the robotics community as it may render the learning process faster, cheaper, and safer by alleviating the need for expensive real-world experiments. However, the direct transfer of learned behavior from simulation to reality is a major challenge. Optimizing a policy on a slightly faulty simulator can easily lead to the maximization of the 'Simulation Optimization Bias' (SOB). In this case, the optimizer exploits modeling errors of the simulator such that the resulting behavior can potentially damage the robot. We tackle this challenge by applying domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training. Our experimental results on two different second order nonlinear systems show that the new simulation-based policy search algorithm is able to learn a control policy exclusively from a randomized simulator, which can be applied directly to real systems without any additional training.

[1]  Andrew J. Davison,et al.  Sim-to-Real Reinforcement Learning for Deformable Object Manipulation , 2018, CoRL.

[2]  Marcin Andrychowicz,et al.  Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[3]  Rika Antonova,et al.  Unlocking the Potential of Simulators: Design with RL in Mind , 2017, ArXiv.

[4]  Greg Turk,et al.  Preparing for the Unknown: Learning a Universal Policy with Online System Identification , 2017, Robotics: Science and Systems.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[7]  Hod Lipson,et al.  Resilient Machines Through Continuous Self-Modeling , 2006, Science.

[8]  Peter Englert,et al.  Multi-task policy search for robotics , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[10]  Jan Peters,et al.  Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment , 2018, CoRL.

[11]  Christopher K. I. Williams,et al.  Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning) , 2005 .

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14]  B. Efron,et al.  Bootstrap confidence intervals , 1996 .

[15]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[16]  R. Pasupathy,et al.  A Guide to Sample Average Approximation , 2015 .

[17]  R. Pasupathy,et al.  Retrospective approximation algorithms for the multidimensional stochastic root-finding problem , 2004 .

[18]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[19]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[20]  Sergey Levine,et al.  Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping , 2018, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Rolf Isermann,et al.  Identification of Dynamic Systems: An Introduction with Applications , 2010 .

[22]  Peter Stone,et al.  Stochastic Grounded Action Transformation for Robot Learning in Simulation , 2017, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[23]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[24]  Sham M. Kakade,et al.  Towards Generalization and Simplicity in Continuous Control , 2017, NIPS.

[25]  Sergey Levine,et al.  (CAD)$^2$RL: Real Single-Image Flight without a Single Real Image , 2016, Robotics: Science and Systems.

[26]  Raghu Pasupathy,et al.  Retrospective approximation algorithms for the multidimensional stochastic root-finding problem , 2004, Proceedings of the 2004 Winter Simulation Conference, 2004..

[27]  Silvio Savarese,et al.  Adversarially Robust Policy Learning: Active construction of physically-plausible perturbations , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[28]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[29]  David P. Morton,et al.  Monte Carlo bounding techniques for determining solution quality in stochastic programs , 1999, Oper. Res. Lett..

[30]  MODEL-ENSEMBLE TRUST-REGION POLICY OPTI- , 2017 .

[31]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[32]  David J. Fleet,et al.  Optimizing walking controllers for uncertain inputs and environments , 2010, ACM Trans. Graph..

[33]  Stéphane Doncieux,et al.  The Transferability Approach: Crossing the Reality Gap in Evolutionary Robotics , 2013, IEEE Transactions on Evolutionary Computation.

[34]  Inman Harvey,et al.  Noise and the Reality Gap: The Use of Simulation in Evolutionary Robotics , 1995, ECAL.

[35]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[36]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[37]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[38]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[39]  Philippe L. Toint,et al.  Convergence theory for nonconvex stochastic programming with an application to mixed logit , 2006, Math. Program..

[40]  Emanuel Todorov,et al.  Ensemble-CIO: Full-body dynamic motion planning that transfers to physical humanoids , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[41]  Benjamin F. Hobbs,et al.  Is optimization optimistically biased , 1989 .

[42]  Razvan Pascanu,et al.  Sim-to-Real Robot Learning from Pixels with Progressive Nets , 2016, CoRL.

[43]  David P. Morton,et al.  Assessing solution quality in stochastic programs , 2006, Algorithms for Optimization with Incomplete Information.