论文信息 - Manifold-based multi-objective policy search with sample reuse

Manifold-based multi-objective policy search with sample reuse

Abstract Many real-world applications are characterized by multiple conflicting objectives. In such problems optimality is replaced by Pareto optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. Building on recent advances in reinforcement learning and multi-objective policy search, we present two novel manifold-based algorithms to solve multi-objective Markov decision processes. These algorithms combine episodic exploration strategies and importance sampling to efficiently learn a manifold in the policy parameter space such that its image in the objective space accurately approximates the Pareto frontier. We show that episode-based approaches and importance sampling can lead to significantly better results in the context of multi-objective reinforcement learning. Evaluated on three multi-objective problems, our algorithms outperform state-of-the-art methods both in terms of quality of the learned Pareto frontier and sample efficiency.

[1] Shimon Whiteson,et al. A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[2] Michèle Sebag,et al. Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search , 2013, Machine Learning.

[3] Csaba Szepesvári,et al. Multi-criteria Reinforcement Learning , 1998, ICML.

[4] Naoyuki Kubota,et al. Local episode-based learning of multi-objective behavior coordination for a mobile robot in dynamic environments , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[5] Stefan Schaal,et al. Natural Actor-Critic , 2003, Neurocomputing.

[6] Tom Schaul,et al. Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[7] Marcello Restelli,et al. A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run , 2013 .

[8] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9] Jan Peters,et al. Learning concurrent motor skills in versatile solution spaces , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10] Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .

[11] Andrew G. Barto,et al. Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[12] Shie Mannor,et al. A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[13] Gary B. Lamont,et al. Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation) , 2006 .

[14] Ann Nowé,et al. Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[15] Marcello Restelli,et al. Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation , 2014, AAAI.

[16] Shun-ichi Amari,et al. Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17] Yasemin Altun,et al. Relative Entropy Policy Search , 2010 .

[18] Kalyanmoy Deb,et al. A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[19] Andrei V. Kelarev,et al. Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[20] Sriraam Natarajan,et al. Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[21] David Levine,et al. Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[22] Susan A. Murphy,et al. Linear fitted-Q iteration with multiple reward functions , 2013, J. Mach. Learn. Res..

[23] Srini Narayanan,et al. Learning all optimal policies with multiple criteria , 2008, ICML '08.

[24] Yishay Mansour,et al. Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[25] Gang Niu,et al. Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[26] Stefan Schaal,et al. 2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[27] Andrea Castelletti,et al. Tree-based Fitted Q-iteration for Multi-Objective Markov Decision problems , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[28] Christian R. Shelton,et al. Importance sampling for reinforcement learning with multiple objectives , 2001 .

[29] Susan A. Murphy,et al. Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[30] Gary B. Lamont,et al. Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[31] Isao Ono,et al. Local Search for Multiobjective Function Optimization: Pareto Descent Method , 2006 .

[32] Luca Bascetta,et al. Policy gradient approaches for multi-objective sequential decision making , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[33] Andrea Castelletti,et al. Reinforcement learning in the operational management of a water system , 2002 .

[34] Shie Mannor,et al. The Steering Approach for Multi-Criteria Reinforcement Learning , 2001, NIPS.

[35] Stefan Roth,et al. Covariance Matrix Adaptation for Multi-objective Optimization , 2007, Evolutionary Computation.

[36] A. Owen,et al. Safe and Effective Importance Sampling , 2000 .

[37] Tom Schaul,et al. Efficient natural evolution strategies , 2009, GECCO.

[38] Nicola Beume,et al. SMS-EMOA: Multiobjective selection based on dominated hypervolume , 2007, Eur. J. Oper. Res..

[39] Jan Peters,et al. Reinforcement learning vs human programming in tetherball robot games , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40] Evan Dekker,et al. Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[41] Martha White,et al. Linear Off-Policy Actor-Critic , 2012, ICML.

[42] Jun Morimoto,et al. Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[43] P. Papalambros,et al. A NOTE ON WEIGHTED CRITERIA METHODS FOR COMPROMISE SOLUTIONS IN MULTI-OBJECTIVE OPTIMIZATION , 1996 .

[44] Jan Peters,et al. A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[45] Lothar Thiele,et al. The Hypervolume Indicator Revisited: On the Design of Pareto-compliant Indicators Via Weighted Integration , 2007, EMO.

[46] Marco Laumanns,et al. Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..

[47] J. Dennis,et al. A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems , 1997 .

[48] Darwin G. Caldwell,et al. Multi-objective reinforcement learning for AUV thruster failure recovery , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[49] Kalyanmoy Deb,et al. A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[50] Tom Schaul,et al. Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[51] Jun Nakanishi,et al. Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.