暂无分享,去创建一个
[1] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.
[2] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[3] Max Simchowitz,et al. Non-Asymptotic Gap-Dependent Regret Bounds for Tabular MDPs , 2019, NeurIPS.
[4] Aurélien Garivier,et al. Parametric Bandits: The Generalized Linear Case , 2010, NIPS.
[5] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[6] Per Ola Börjesson,et al. Simple Approximations of the Error Function Q(x) for Communications Applications , 1979, IEEE Trans. Commun..
[7] Thomas P. Hayes,et al. Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.
[8] Haipeng Luo,et al. Learning Adversarial MDPs with Bandit Feedback and Unknown Transition , 2019, ArXiv.
[9] Gergely Neu,et al. Online learning in episodic Markovian decision processes by relative entropy policy search , 2013, NIPS.
[10] Shipra Agrawal,et al. Thompson Sampling for Contextual Bandits with Linear Payoffs , 2012, ICML.
[11] Daniel Russo,et al. Worst-Case Regret Bounds for Exploration via Randomized Value Functions , 2019, NeurIPS.
[12] Christoph Dann,et al. Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning , 2015, NIPS.
[13] S. Kakade,et al. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.
[14] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .
[15] Shie Mannor,et al. Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.
[16] Csaba Szepesvári,et al. Improved Algorithms for Linear Stochastic Bandits , 2011, NIPS.
[17] Tor Lattimore,et al. Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.
[18] P. Massart,et al. Adaptive estimation of a quadratic functional by model selection , 2000 .
[19] Alper Atamtürk,et al. Maximizing a Class of Utility Functions Over the Vertices of a Polytope , 2017, Oper. Res..
[20] Benjamin Van Roy,et al. (More) Efficient Reinforcement Learning via Posterior Sampling , 2013, NIPS.
[21] Rémi Munos,et al. Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.
[22] Benjamin Van Roy,et al. On Lower Bounds for Regret in Reinforcement Learning , 2016, ArXiv.
[23] Peter Auer,et al. Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..
[24] Alessandro Lazaric,et al. Linear Thompson Sampling Revisited , 2016, AISTATS.
[25] Sergey Levine,et al. End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..
[26] Wei Chu,et al. Contextual Bandits with Linear Payoff Functions , 2011, AISTATS.
[27] John Langford,et al. Contextual Bandit Algorithms with Supervised Learning Guarantees , 2010, AISTATS.
[28] Shie Mannor,et al. Thompson Sampling for Learning Parameterized Markov Decision Processes , 2014, COLT.
[29] Michael I. Jordan,et al. Is Q-learning Provably Efficient? , 2018, NeurIPS.
[30] Craig Boutilier,et al. Randomized Exploration in Generalized Linear Bandits , 2019, AISTATS.
[31] Ronald J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.
[32] M. Bartlett. An Inverse Matrix Adjustment Arising in Discriminant Analysis , 1951 .
[33] Emma Brunskill,et al. Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.
[34] Lihong Li,et al. Policy Certificates: Towards Accountable Reinforcement Learning , 2018, ICML.
[35] Benjamin Van Roy,et al. Why is Posterior Sampling Better than Optimism for Reinforcement Learning? , 2016, ICML.
[36] E. Altman. Constrained Markov Decision Processes , 1999 .