论文信息 - Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization

Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization

In practice, the parameters of control policies are often tuned manually. This is time-consuming and frustrating. Reinforcement learning is a promising alternative that aims to automate this process, yet often requires too many experiments to be practical. In this paper, we propose a solution to this problem by exploiting prior knowledge from simulations, which are readily available for most robotic platforms. Specifically, we extend Entropy Search, a Bayesian optimization algorithm that maximizes information gain from each experiment, to the case of multiple information sources. The result is a principled way to automatically combine cheap, but inaccurate information from simulations with expensive and accurate physical experiments in a cost-effective manner. We apply the resulting method to a cart-pole system, which confirms that the algorithm can find good control policies with fewer experiments than standard Bayesian optimization on the physical system only.

[1] B. Anderson,et al. Optimal control: linear quadratic methods , 1990 .

[2] A. O'Hagan,et al. Predicting the output from a complex computer code when fast approximations are available , 2000 .

[3] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[4] Pieter Abbeel,et al. Using inaccurate models in reinforcement learning , 2006, ICML.

[5] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[6] Alexander I. J. Forrester,et al. Multi-fidelity optimization via surrogate modelling , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[7] Tao Wang,et al. Automatic Gait Optimization with Gaussian Process Regression , 2007, IJCAI.

[8] Peter Stone,et al. Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[9] Eric Walter,et al. An informational approach to the global optimization of expensive-to-evaluate functions , 2006, J. Glob. Optim..

[10] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[11] Warren B. Powell,et al. The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[12] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[13] Ian R. Manchester,et al. Feedback controller parameterizations for Reinforcement Learning , 2011, 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[14] Andreas Krause,et al. Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[15] Howie Choset,et al. Using response surfaces and expected improvement to optimize snake robot gait parameters , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[16] Philipp Hennig,et al. Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[17] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[18] Jasper Snoek,et al. Multi-Task Bayesian Optimization , 2013, NIPS.

[19] R. D’Andrea,et al. A Self-Tuning LQR Approach Demonstrated on an Inverted Pendulum , 2014 .

[20] Jan Peters,et al. An experimental comparison of Bayesian optimization for bipedal locomotion , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[21] Jonathan P. How,et al. Real-World Reinforcement Learning via Multifidelity Simulators , 2015, IEEE Transactions on Robotics.

[22] Jonathan P. How,et al. Efficient reinforcement learning for robots using informative simulated priors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[23] Matthias Poloczek,et al. Multi-Information Source Optimization with General Model Discrepancies , 2016 .

[24] Stefan Schaal,et al. Automatic LQR tuning based on Gaussian process global optimization , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[25] Andreas Krause,et al. Bayesian optimization for maximum power point tracking in photovoltaic power plants , 2016, 2016 European Control Conference (ECC).

[26] Kirthevasan Kandasamy,et al. Multi-fidelity Gaussian Process Bandit Optimisation , 2016, J. Artif. Intell. Res..

[27] Andreas Krause,et al. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics , 2016, Machine Learning.