A Geometric Approach to Multi-Criterion Reinforcement Learning

We consider the problem of reinforcement learning in a controlled Markov environment with multiple objective functions of the long-term average reward type. The environment is initially unknown, and furthermore may be affected by the actions of other agents, actions that are observed but cannot be predicted beforehand. We capture this situation using a stochastic game model, where the learning agent is facing an adversary whose policy is arbitrary and unknown, and where the reward function is vector-valued. State recurrence conditions are imposed throughout. In our basic problem formulation, a desired target set is specified in the vector reward space, and the objective of the learning agent is to approach the target set, in the sense that the long-term average reward vector will belong to this set. We devise appropriate learning algorithms, that essentially use multiple reinforcement learning algorithms for the standard scalar reward problem, which are combined using the geometric insight from the theory of approachability for vector-valued stochastic games. We then address the more general and optimization-related problem, where a nested class of possible target sets is prescribed, and the goal of the learning agent is to approach the smallest possible target set (which will generally depend on the unknown system parameters). A particular case which falls into this framework is that of stochastic games with average reward constraints, and further specialization provides a reinforcement learning algorithm for constrained Markov decision processes. Some basic examples are provided to illustrate these results.

[1]  Andrew G. Barto,et al.  Reinforcement learning , 1998 .

[2]  H. Moskowitz,et al.  Generalized dynamic programming for multicriteria optimization , 1990 .

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[5]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[6]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[7]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[8]  Sridhar Mahadevan,et al.  Average reward reinforcement learning: Foundations, algorithms, and empirical results , 2004, Machine Learning.

[9]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[10]  Y. Freund,et al.  The non-stochastic multi-armed bandit problem , 2001 .

[11]  Vijay R. Konda,et al.  OnActor-Critic Algorithms , 2003, SIAM J. Control. Optim..

[12]  Cyrus Derman,et al.  Finite State Markovian Decision Processes , 1970 .

[13]  R. S. Laundy,et al.  Multiple Criteria Optimisation: Theory, Computation and Application , 1989 .

[14]  A. Tversky,et al.  Prospect Theory : An Analysis of Decision under Risk Author ( s ) : , 2007 .

[15]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 1998, Machine Learning.

[16]  Timothy X. Brown,et al.  Switch Packet Arbitration via Queue-Learning , 2001, NIPS.

[17]  R. Karp,et al.  On Nonterminating Stochastic Games , 1966 .

[18]  S. Deming Multiple-criteria optimization , 1991 .

[19]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[20]  Ralph E. Steuer Multiple criteria optimization , 1986 .

[21]  D. White Multi-objective infinite-horizon discounted Markov decision processes , 1982 .

[22]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[23]  Nahum Shimkin,et al.  Stochastic Games with Average Cost Constraints , 1994 .

[24]  A. Shwartz,et al.  Guaranteed performance regions in Markovian systems with competing decision makers , 1993, IEEE Trans. Autom. Control..

[25]  J. Filar,et al.  Competitive Markov Decision Processes , 1996 .

[26]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[27]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[28]  John N. Tsitsiklis,et al.  Actor-Critic Algorithms , 1999, NIPS.

[29]  Alexander S. Poznyak,et al.  Self-Learning Control of Finite Markov Chains , 2000 .

[30]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[31]  Vivek S. Borkar,et al.  Stochastic Approximation for Nonexpansive Maps: Application to Q-Learning Algorithms , 1997, SIAM J. Control. Optim..

[32]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[33]  M. I. Henig Vector-Valued Dynamic Programming , 1983 .

[34]  Shie Mannor,et al.  Reinforcement Learning for Average Reward Zero-Sum Games , 2004, COLT.

[35]  Andrew Tridgell,et al.  TDLeaf(lambda): Combining Temporal Difference Learning with Game-Tree Search , 1999, ArXiv.

[36]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[37]  Jonathan Baxter,et al.  TDLeaf ( ) : Combining Temporal Difference Learning with Game-Tree Search , 1998 .

[38]  E. Altman Constrained Markov Decision Processes , 1999 .

[39]  Gerald Tesauro,et al.  TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[40]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[41]  Michael Kearns,et al.  Finite-Sample Convergence Rates for Q-Learning and Indirect Algorithms , 1998, NIPS.

[42]  Shie Mannor,et al.  The Empirical Bayes Envelope and Regret Minimization in Competitive Markov Decision Processes , 2003, Math. Oper. Res..

[43]  Herbert A. Simon,et al.  The Sciences of the Artificial , 1970 .

[44]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[45]  Eitan Altman,et al.  Constrained Markov Games: Nash Equilibria , 2000 .