Learning Robust Control Policies from Simulations with Perturbed Parameters

Deep reinforcement learning has a lot of potential to solve robotics motor tasks. However, deep learning algorithms suffer from a high sample complexity, thus training on the real robot is unfeasible. Training in simulated environments has been successful, but transferring the learned policies to the real robot is difficult. Recent research has focused on learning robust policies which can be transferred successfully. In this thesis, we developed a framework to learn robust policies and evaluate them on environments with varied physics parameters. The simulation was implemented using the Rcs framework, developed by Honda Research Institute Europe, which can control both simulated and real-world robots. The applied learning algorithms are based on the rllab framework. In order to allow using Rcs-based environments, we wrote RcsPySim as a bridge. Throughout the thesis, we used RcsPySim to evaluate and compare the robust learning algorithms Ensemble Policy Optimization (EPOpt) and Robust Adversarial Reinforcement Learning (RARL) with each other and with a baseline policy trained using the standard policy optimization algorithm Trust Region Policy Optimization (TRPO). The policies were trained and evaluated on the ball-on-plate task, where a robot has to stabilize the ball at the center of the plate which is mounted on the robot’s end-effector. We varied different physics parameters in order to analyze the robustness of the learned policies against these changes. Furthermore, we performed a sensitivity analysis of the parameters. The obtained results show that the relevant physics parameters for the ball-on-plate task are the balls friction properties and mass distribution, whereas the ball’s mass and radius do not have a significant influence. Moreover, we observed that the baseline policy, trained solely on the nominal physics world, is already quite robust. Both RARL and EPOpt increase the robustness for some parameter ranges, but reduce it in others. EPOpt does not perform as well as expected. Generally, EPOpt always prefers the more cautious approach, which means it can deal better with more unstable simulations, but it will not be able to solve the task in a setup with strong friction. The solution trained by RARL is more aggressive, making it well suited for solving cases with higher friction, but less attractive for more unstable environments with lower friction. All laerned policies could be transferred to the real world. The policy learned by EPOpt should be preferred as it is the most cautious. Since EPOpt doesn’t work well with friction values higher then the nominal parameters, choosing their values to be higher then the measured mean would likely increase the robustness over the whole parameter space.

[1]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[3]  Pietro Falco,et al.  Data-efficient control policy search using residual dynamics learning , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Pieter Abbeel,et al.  Policy transfer via modularity and reward guiding , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[5]  Jean-Baptiste Mouret,et al.  Black-box data-efficient policy search for robotics , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[6]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[7]  Balaraman Ravindran,et al.  EPOpt: Learning Robust Neural Network Policies Using Model Ensembles , 2016, ICLR.

[8]  Sergey Levine,et al.  Collective robot reinforcement learning with distributed asynchronous guided policy search , 2016, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[9]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[10]  Yuval Tassa,et al.  Simulation tools for model-based robotics: Comparison of Bullet, Havok, MuJoCo, ODE and PhysX , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[11]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[12]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[13]  Petros Koumoutsakos,et al.  Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES) , 2003, Evolutionary Computation.

[14]  Inman Harvey,et al.  Noise and the Reality Gap: The Use of Simulation in Evolutionary Robotics , 1995, ECAL.