论文信息 - Non-parametric Policy Search with Limited Information Loss

Non-parametric Policy Search with Limited Information Loss

Learning complex control policies from non-linear and redundant sensory input is an important challenge for reinforcement learning algorithms. Non-parametric methods that approximate values functions or transition models can address this problem, by adapting to the complexity of the data set. Yet, many current non-parametric approaches rely on unstable greedy maximization of approximate value functions, which might lead to poor convergence or oscillations in the policy update. A more robust policy update can be obtained by limiting the information loss between successive state-action distributions. In this paper, we develop a policy search algorithm with policy updates that are both robust and non-parametric. Our method can learn non- parametric control policies for infinite horizon continuous Markov decision processes with non-linear and redundant sensory representations. We investigate how we can use approximations of the kernel function to reduce the time requirements of the demanding non-parametric computations. In our experiments, we show the strong performance of the proposed method, and how it can be approximated efficiently. Finally, we show that our algorithm can learn a real-robot under-powered swing-up task directly from image data.

[1] Byron Boots,et al. Hilbert Space Embeddings of Predictive State Representations , 2013, UAI.

[2] Jan Peters,et al. Reinforcement Learning to Adjust Robot Movements to New Situations , 2010, IJCAI.

[3] Klaus Obermayer,et al. Construction of approximation spaces for reinforcement learning , 2013, J. Mach. Learn. Res..

[4] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[5] Jan Peters,et al. Generalizing Movements with Information-Theoretic Stochastic Optimal Control , 2014, J. Aerosp. Inf. Syst..

[6] Jan Peters,et al. Learning robot in-hand manipulation with tactile features , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[7] Marc Toussaint,et al. Path Integral Control by Reproducing Kernel Hilbert Space Embedding , 2013, IJCAI.

[8] Guy Lever,et al. Modelling transition dynamics in MDPs with RKHS embeddings , 2012, ICML.

[9] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[10] André da Motta Salles Barreto,et al. Reinforcement Learning using Kernel-Based Stochastic Factorization , 2011, NIPS.

[11] Jan Peters,et al. Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[12] Carl E. Rasmussen,et al. Gaussian process dynamic programming , 2009, Neurocomputing.

[13] Kenji Fukumizu,et al. Hilbert Space Embeddings of POMDPs , 2012, UAI.

[14] Csaba Szepesvári,et al. Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[15] Peter L. Bartlett,et al. An Introduction to Reinforcement Learning Theory: Value Function Methods , 2002, Machine Learning Summer School.

[16] Cristina P. Santos,et al. Using Cost-regularized Kernel Regression with a high number of samples , 2014, 2014 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC).

[17] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[18] Benjamin Recht,et al. Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[19] Shie Mannor,et al. Bayes Meets Bellman: The Gaussian Process Approach to Temporal Difference Learning , 2003, ICML.

[20] Martin A. Riedmiller,et al. Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[21] Guy Lever,et al. Modelling Policies in MDPs in Reproducing Kernel Hilbert Space , 2015, AISTATS.

[22] Geoffrey E. Hinton,et al. Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[23] N. Roy,et al. On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2013 .

[24] Vicenç Gómez,et al. Dynamic Policy Programming with Function Approximation , 2011, AISTATS.

[25] Stefan Schaal,et al. Policy Gradient Methods for Robotics , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[26] Daniel Polani,et al. Information Theory of Decisions and Actions , 2011 .

[27] Jan Peters,et al. Data-Efficient Generalization of Robot Skills with Contextual Policy Search , 2013, AAAI.

[28] Dirk Ormoneit,et al. Kernel-Based Reinforcement Learning , 2017, Encyclopedia of Machine Learning and Data Mining.

[29] Shie Mannor,et al. Biases and Variance in Value Function Estimates , 2004 .

[30] Alex Smola,et al. Kernel methods in machine learning , 2007, math/0701907.

[31] Alessandro Lazaric,et al. LSTD with Random Projections , 2010, NIPS.

[32] Oliver Brock,et al. Learning state representations with robotic priors , 2015, Auton. Robots.

[33] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[34] Zoubin Ghahramani,et al. Sparse Gaussian Processes using Pseudo-inputs , 2005, NIPS.

[35] Neil D. Lawrence,et al. Fast Forward Selection to Speed Up Sparse Gaussian Process Regression , 2003, AISTATS.

[36] Xin Xu,et al. Kernel-Based Least Squares Policy Iteration for Reinforcement Learning , 2007, IEEE Transactions on Neural Networks.

[37] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[38] David K. Smith,et al. Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[39] Jason Pazis,et al. Non-Parametric Approximate Linear Programming for MDPs , 2011, AAAI.

[40] Jan Peters,et al. Policy Learning - A Unified Perspective with Applications in Robotics , 2008, EWRL.

[41] Peter Vrancx,et al. Reinforcement Learning: State-of-the-Art , 2012 .

[42] Martin A. Riedmiller,et al. Learn to Swing Up and Balance a Real Pole Based on Raw Visual Input Data , 2012, ICONIP.

[43] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[44] David B. Dunson,et al. Bayesian Data Analysis , 2010 .

[45] Byron Boots,et al. Closing the learning-planning loop with predictive state representations , 2009, Int. J. Robotics Res..

[46] Gavin Taylor,et al. Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[47] Yuval Tassa,et al. Continuous control with deep reinforcement learning , 2015, ICLR.

[48] Carl E. Rasmussen,et al. Gaussian Processes in Reinforcement Learning , 2003, NIPS.

[49] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[50] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[51] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[52] Warren B. Powell,et al. “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[53] Le Song,et al. A unified kernel framework for nonparametric inference in graphical models ] Kernel Embeddings of Conditional Distributions , 2013 .

[54] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[55] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[56] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[57] Richard S. Sutton,et al. Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[58] Joelle Pineau,et al. Bellman Error Based Feature Generation using Random Projections on Sparse Spaces , 2013, NIPS.

[59] Sergey Levine,et al. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[60] T. Jung,et al. Kernelizing LSPE(λ) , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[61] Lihong Li,et al. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning , 2008, ICML '08.

[62] Warren B. Powell,et al. Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[63] George Konidaris,et al. Value Function Approximation in Reinforcement Learning Using the Fourier Basis , 2011, AAAI.

[64] Brian Kingsbury,et al. A comparison between deep neural nets and kernel acoustic models for speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[65] Martin A. Riedmiller,et al. Autonomous reinforcement learning on raw visual input data in a real world application , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[66] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.

[67] Bernhard Schölkopf,et al. A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[68] Pieter Abbeel,et al. Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[69] Jan Peters,et al. Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[70] Daniele Calandriello,et al. Safe Policy Iteration , 2013, ICML.

[71] Stefan Schaal,et al. Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[72] Doina Precup,et al. An information-theoretic approach to curiosity-driven reinforcement learning , 2012, Theory in Biosciences.

[73] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[74] Peter Englert,et al. Policy Search in Reproducing Kernel Hilbert Space , 2016, IJCAI.

[75] Richard S. Sutton,et al. Generalization in Reinforcement Learning: Successful Examples Using Sparse Coarse Coding , 1995, NIPS.

[76] Matthew W. Hoffman,et al. Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[77] Guy Lever,et al. Conditional mean embeddings as regressors , 2012, ICML.

[78] Haibo He,et al. Kernel-Based Approximate Dynamic Programming for Real-Time Online Learning Control: An Experimental Study , 2014, IEEE Transactions on Control Systems Technology.

[79] Gunnar Rätsch,et al. Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[80] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[81] Jan Peters,et al. Learning of Non-Parametric Control Policies with High-Dimensional State Features , 2015, AISTATS.

[82] Oliver Kroemer,et al. A Non-Parametric Approach to Dynamic Programming , 2011, NIPS.

[83] Jeff G. Schneider,et al. Covariant Policy Search , 2003, IJCAI.

[84] Sergey Levine,et al. Deep spatial autoencoders for visuomotor learning , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[85] Michail G. Lagoudakis,et al. Least-Squares Policy Iteration , 2003, J. Mach. Learn. Res..

[86] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[87] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[88] Bart De Schutter,et al. Reinforcement Learning and Dynamic Programming Using Function Approximators , 2010 .

[89] Alexander J. Smola,et al. Hilbert space embeddings of conditional distributions with applications to dynamical systems , 2009, ICML '09.

[90] Carl E. Rasmussen,et al. PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.