论文信息 - PAC Bandits with Risk Constraints - 字舞流文

PAC Bandits with Risk Constraints

We study the problem of best arm identification with risk constraints within the setting of fixed confidence pure exploration bandits (PAC bandits). The goal is to stop as fast as possible, and with high confidence return an arm whose mean is -close to the best arm among those that satisfy a risk constraint, namely their α-quantile functions are larger than a threshold β. For this risk-sensitive bandit problem, we propose an algorithm and prove an upper-bound on its sample complexity for the general case of sub-Gaussian arms’ distributions. We also prove a lower-bound for this general case that shows our derived upper-bound is near-optimal (up to logarithmic factors). Both our upper and lower bounds have similar form to the risk-neutral PAC bandits results of (Even-Dar et al. 2006) and (Mannor and Tsitsiklis 2004), respectively. We also prove a lower-bound for our problem when the arms’ distributions are Gaussian, which is smaller than our general lower-bound, but is stronger in the sense that it applies to any instance of the (Gaussian) problem. This lower-bound is in terms of the KL divergence and has similar behavior to the risk-neutral PAC bandits results of (Kaufmann et al. 2016).

Shie Mannor | Balázs Szörényi | Mohammad Ghavamzadeh | Nahum Shimkin | Yahel David

[1] Krishnendu Chatterjee,et al. Generalized Risk-Aversion in Stochastic Multi-Armed Bandits , 2014, ArXiv.

[2] Alessandro Lazaric,et al. Best Arm Identification: A Unified Approach to Fixed Budget and Fixed Confidence , 2012, NIPS.

[3] W. R. Thompson. On the Theory of Apportionment , 1935 .

[4] John N. Tsitsiklis,et al. The Sample Complexity of Exploration in the Multi-Armed Bandit Problem , 2004, J. Mach. Learn. Res..

[5] Odalric-Ambrym Maillard,et al. Robust Risk-Averse Stochastic Multi-armed Bandits , 2013, ALT.

[6] Shie Mannor,et al. Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems , 2006, J. Mach. Learn. Res..

[7] Matthew Malloy,et al. lil' UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits , 2013, COLT.

[8] Ambuj Tewari,et al. PAC Subset Selection in Stochastic Multi-armed Bandits , 2012, ICML.

[9] Shivaram Kalyanakrishnan,et al. Information Complexity in Bandit Subset Selection , 2013, COLT.

[10] R. Rockafellar,et al. Optimization of conditional value-at risk , 2000 .

[11] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[12] Robert Nowak,et al. A KL-LUCB algorithm for Large-Scale Crowdsourcing , 2017, NIPS.

[13] H. Robbins. Some aspects of the sequential design of experiments , 1952 .

[14] Michèle Sebag,et al. Exploration vs Exploitation vs Safety: Risk-Aware Multi-Armed Bandits , 2013, ACML.

[15] Jia Yuan Yu,et al. Sample Complexity of Risk-Averse Bandit-Arm Selection , 2013, IJCAI.

[16] Qing Zhao,et al. Risk-Averse Multi-Armed Bandit Problems Under Mean-Variance Measure , 2016, IEEE Journal of Selected Topics in Signal Processing.

[17] Oren Somekh,et al. Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[18] Philippe Artzner,et al. Coherent Measures of Risk , 1999 .

[19] Aurélien Garivier,et al. On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[20] Stefano Ermon,et al. Best arm identification in multi-armed bandits with delayed feedback , 2018, AISTATS.