Manifold-based multi-objective policy search with sample reuse

Abstract Many real-world applications are characterized by multiple conflicting objectives. In such problems optimality is replaced by Pareto optimality and the goal is to find the Pareto frontier, a set of solutions representing different compromises among the objectives. Despite recent advances in multi-objective optimization, achieving an accurate representation of the Pareto frontier is still an important challenge. Building on recent advances in reinforcement learning and multi-objective policy search, we present two novel manifold-based algorithms to solve multi-objective Markov decision processes. These algorithms combine episodic exploration strategies and importance sampling to efficiently learn a manifold in the policy parameter space such that its image in the objective space accurately approximates the Pareto frontier. We show that episode-based approaches and importance sampling can lead to significantly better results in the context of multi-objective reinforcement learning. Evaluated on three multi-objective problems, our algorithms outperform state-of-the-art methods both in terms of quality of the learned Pareto frontier and sample efficiency.

[1]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[2]  Michèle Sebag,et al.  Hypervolume indicator and dominance reward based multi-objective Monte-Carlo Tree Search , 2013, Machine Learning.

[3]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[4]  Naoyuki Kubota,et al.  Local episode-based learning of multi-objective behavior coordination for a mobile robot in dynamic environments , 2003, The 12th IEEE International Conference on Fuzzy Systems, 2003. FUZZ '03..

[5]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[6]  Tom Schaul,et al.  Natural Evolution Strategies , 2008, 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence).

[7]  Marcello Restelli,et al.  A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run , 2013 .

[8]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[9]  Jan Peters,et al.  Learning concurrent motor skills in versatile solution spaces , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Machine Learning of Motor Skills for Robotics, Jan Peters , 2022 .

[11]  Andrew G. Barto,et al.  Elevator Group Control Using Multiple Reinforcement Learning Agents , 1998, Machine Learning.

[12]  Shie Mannor,et al.  A Geometric Approach to Multi-Criterion Reinforcement Learning , 2004, J. Mach. Learn. Res..

[13]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems (Genetic and Evolutionary Computation) , 2006 .

[14]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[15]  Marcello Restelli,et al.  Multi-Objective Reinforcement Learning with Continuous Pareto Frontier Approximation , 2014, AAAI.

[16]  Shun-ichi Amari,et al.  Why natural gradient? , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[18]  Kalyanmoy Deb,et al.  A Fast Elitist Non-dominated Sorting Genetic Algorithm for Multi-objective Optimisation: NSGA-II , 2000, PPSN.

[19]  Andrei V. Kelarev,et al.  Constructing Stochastic Mixture Policies for Episodic Multiobjective Reinforcement Learning Tasks , 2009, Australasian Conference on Artificial Intelligence.

[20]  Sriraam Natarajan,et al.  Dynamic preferences in multi-criteria reinforcement learning , 2005, ICML.

[21]  David Levine,et al.  Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning , 2007, NIPS.

[22]  Susan A. Murphy,et al.  Linear fitted-Q iteration with multiple reward functions , 2013, J. Mach. Learn. Res..

[23]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[24]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[25]  Gang Niu,et al.  Analysis and Improvement of Policy Gradient Estimation , 2011, NIPS.

[26]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[27]  Andrea Castelletti,et al.  Tree-based Fitted Q-iteration for Multi-Objective Markov Decision problems , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[28]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[29]  Susan A. Murphy,et al.  Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial Analysis , 2010, ICML.

[30]  Gary B. Lamont,et al.  Evolutionary Algorithms for Solving Multi-Objective Problems , 2002, Genetic Algorithms and Evolutionary Computation.

[31]  Isao Ono,et al.  Local Search for Multiobjective Function Optimization: Pareto Descent Method , 2006 .

[32]  Luca Bascetta,et al.  Policy gradient approaches for multi-objective sequential decision making , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[33]  Andrea Castelletti,et al.  Reinforcement learning in the operational management of a water system , 2002 .

[34]  Shie Mannor,et al.  The Steering Approach for Multi-Criteria Reinforcement Learning , 2001, NIPS.

[35]  Stefan Roth,et al.  Covariance Matrix Adaptation for Multi-objective Optimization , 2007, Evolutionary Computation.

[36]  A. Owen,et al.  Safe and Effective Importance Sampling , 2000 .

[37]  Tom Schaul,et al.  Efficient natural evolution strategies , 2009, GECCO.

[38]  Nicola Beume,et al.  SMS-EMOA: Multiobjective selection based on dominated hypervolume , 2007, Eur. J. Oper. Res..

[39]  Jan Peters,et al.  Reinforcement learning vs human programming in tetherball robot games , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[41]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[42]  Jun Morimoto,et al.  Efficient Sample Reuse in Policy Gradients with Parameter-Based Exploration , 2012, Neural Computation.

[43]  P. Papalambros,et al.  A NOTE ON WEIGHTED CRITERIA METHODS FOR COMPROMISE SOLUTIONS IN MULTI-OBJECTIVE OPTIMIZATION , 1996 .

[44]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[45]  Lothar Thiele,et al.  The Hypervolume Indicator Revisited: On the Design of Pareto-compliant Indicators Via Weighted Integration , 2007, EMO.

[46]  Marco Laumanns,et al.  Performance assessment of multiobjective optimizers: an analysis and review , 2003, IEEE Trans. Evol. Comput..

[47]  J. Dennis,et al.  A closer look at drawbacks of minimizing weighted sums of objectives for Pareto set generation in multicriteria optimization problems , 1997 .

[48]  Darwin G. Caldwell,et al.  Multi-objective reinforcement learning for AUV thruster failure recovery , 2014, 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[49]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[50]  Tom Schaul,et al.  Exploring parameter space in reinforcement learning , 2010, Paladyn J. Behav. Robotics.

[51]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.