Learning Adversarial Markov Decision Processes with Bandit Feedback and Unknown Transition
暂无分享,去创建一个
[1] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.
[2] Xiangyang Ji,et al. Regret Minimization for Reinforcement Learning by Evaluating the Optimal Bias Function , 2019, NeurIPS.
[3] Yishay Mansour,et al. Online Markov Decision Processes , 2009, Math. Oper. Res..
[4] Yishay Mansour,et al. Online Convex Optimization in Adversarial Markov Decision Processes , 2019, ICML.
[5] Yishay Mansour,et al. Online Stochastic Shortest Path with Bandit Feedback and Unknown Transition Function , 2019, NeurIPS.
[6] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.
[7] Shie Mannor,et al. Markov Decision Processes with Arbitrary Reward Processes , 2009, Math. Oper. Res..
[8] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.
[9] E. Altman. Constrained Markov Decision Processes , 1999 .
[10] David Simchi-Levi,et al. Reinforcement Learning under Drift , 2019, ArXiv.
[11] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.
[12] Gergely Neu,et al. Explore no more: Improved high-probability regret bounds for non-stochastic bandits , 2015, NIPS.
[13] Ambuj Tewari,et al. Deterministic MDPs with Adversarial Rewards and Bandit Feedback , 2012, UAI.
[14] Yi Ouyang,et al. Learning Unknown Markov Decision Processes: A Thompson Sampling Approach , 2017, NIPS.
[15] Haipeng Luo,et al. Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes , 2020, ICML.
[16] Apostolos Burnetas,et al. Optimal Adaptive Policies for Markov Decision Processes , 1997, Math. Oper. Res..
[17] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.
[18] Shie Mannor,et al. Arbitrarily modulated Markov decision processes , 2009, Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference.
[19] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[20] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.
[21] Elad Hazan,et al. Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.
[22] Aleksandrs Slivkins,et al. Corruption Robust Exploration in Episodic Reinforcement Learning , 2019, ArXiv.
[23] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.
[24] Elad Hazan,et al. Introduction to Online Convex Optimization , 2016, Found. Trends Optim..
[25] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.
[26] Csaba Szepesvári,et al. Online Markov Decision Processes Under Bandit Feedback , 2010, IEEE Transactions on Automatic Control.
[27] Peter Auer,et al. Hannan Consistency in On-Line Learning in Case of Unbounded Losses Under Partial Monitoring , 2006, ALT.
[28] Yuval Peres,et al. Bandits with switching costs: T2/3 regret , 2013, STOC.
[29] Alessandro Lazaric,et al. Efficient Bias-Span-Constrained Exploration-Exploitation in Reinforcement Learning , 2018, ICML.
[30] Elad Hazan,et al. Better Rates for Any Adversarial Deterministic MDP , 2013, ICML.
[31] Xiaoyu Chen,et al. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP , 2019, ICLR.
[32] Peter Auer,et al. The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..
[33] András György,et al. The adversarial stochastic shortest path problem with unknown transition probabilities , 2012, AISTATS.