Multi-objective Bandits: Optimizing the Generalized Gini Index

We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We propose an online gradient descent algorithm which exploits the convexity of the GGI aggregation function, and controls the exploration in a careful way achieving a distribution-free regret $\tilde{\bigO} (T^{-1/2} )$ with high probability. We test our algorithm on synthetic data as well as on an electric battery control problem where the goal is to trade off the use of the different cells of a battery in order to balance their respective degradation rates.

[1]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[2]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[3]  Jonny Dambrowski Review on Methods of State-of-Charge Estimation with Viewpoint to the Modern LiFePO 4 / Li 4 Ti 5 O 12 Lithium-Ion Systems , 2013 .

[4]  Wlodzimierz Ogryczak,et al.  On solving linear programs with the ordered weighted averaging objective , 2003, Eur. J. Oper. Res..

[5]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[6]  Rong Jin,et al.  Stochastic Convex Optimization with Multiple Objectives , 2013, NIPS.

[7]  T. L. Lai Andherbertrobbins Asymptotically Efficient Adaptive Allocation Rules , 2022 .

[8]  William H Press,et al.  Bandit solutions provide unified ethical models for randomized clinical trials and comparative effectiveness research , 2009, Proceedings of the National Academy of Sciences.

[9]  John A. Weymark,et al.  GENERALIZED GIN 1 INEQUALITY INDICES , 2001 .

[10]  Ann Nowé,et al.  Designing multi-objective multi-armed bandits algorithms: A study , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[11]  Peter L. Bartlett,et al.  Blackwell Approachability and No-Regret Learning are Equivalent , 2010, COLT.

[12]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[13]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  Konkoly Thege Multi-criteria Reinforcement Learning , 1998 .

[16]  Valerie H. Johnson,et al.  Battery performance models in ADVISOR , 2002 .

[17]  Ronald R. Yager,et al.  On ordered weighted averaging aggregation operators in multicriteria decision-making , 1988 .

[18]  H. Dalton The Measurement of the Inequality of Incomes , 1920 .

[19]  Patrice Perny,et al.  A Compromise Programming Approach to multiobjective Markov Decision Processes , 2011, Int. J. Inf. Technol. Decis. Mak..

[20]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[21]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[22]  Roger A. Dougal,et al.  Dynamic lithium-ion battery model for system simulation , 2002 .

[23]  Patrice Perny,et al.  On Minimizing Ordered Weighted Regrets in Multiobjective Markov Decision Processes , 2011, ADT.

[24]  Shie Mannor,et al.  Approachability in unknown games: Online learning meets multi-objective optimization , 2014, COLT.

[25]  Ralph E. Steuer,et al.  An interactive weighted Tchebycheff procedure for multiple objective programming , 1983, Math. Program..

[26]  Gábor J. Székely,et al.  When is a weighted average of ordered sample elements a maximum likelihood estimator of the location parameter , 1989 .

[27]  Moshe Tennenholtz,et al.  Sequential decision making with vector outcomes , 2014, ITCS.

[28]  John N. Tsitsiklis,et al.  Online Learning with Sample Path Constraints , 2009, J. Mach. Learn. Res..