暂无分享,去创建一个
[1] Philip S. Thomas,et al. High Confidence Policy Improvement , 2015, ICML.
[2] Clayton T. Morrison,et al. Blending Autonomous Exploration and Apprenticeship Learning , 2011, NIPS.
[3] Romain Laroche,et al. Safe Policy Improvement with Baseline Bootstrapping , 2017, ICML.
[4] E. Ordentlich,et al. Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .
[5] Vineet Goyal,et al. Robust Markov Decision Process: Beyond Rectangularity , 2018, 1811.00215.
[6] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.
[7] Marcus Hutter,et al. Pessimism About Unknown Unknowns Inspires Conservatism , 2020, COLT.
[8] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.
[9] Daniel Kuhn,et al. Robust Markov Decision Processes , 2013, Math. Oper. Res..
[10] Nando de Freitas,et al. Critic Regularized Regression , 2020, NeurIPS.
[11] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.
[12] Marc G. Bellemare,et al. Increasing the Action Gap: New Operators for Reinforcement Learning , 2015, AAAI.
[13] Yuriy Brun,et al. Preventing undesirable behavior of intelligent machines , 2019, Science.
[14] David Silver,et al. Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.
[15] Yoshua Bengio,et al. Revisiting Fundamentals of Experience Replay , 2020, ICML.
[16] Rémi Munos,et al. Performance Bounds in Lp-norm for Approximate Value Iteration , 2007, SIAM J. Control. Optim..
[17] Jiawei Huang,et al. Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization , 2020, ArXiv.
[18] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .
[19] Mohamed Medhat Gaber,et al. Imitation Learning , 2017, ACM Comput. Surv..
[20] Thorsten Joachims,et al. MOReL : Model-Based Offline Reinforcement Learning , 2020, NeurIPS.
[21] Yifan Wu,et al. Behavior Regularized Offline Reinforcement Learning , 2019, ArXiv.
[22] Doina Precup,et al. Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.
[23] A. Antos,et al. Value-Iteration Based Fitted Policy Iteration: Learning with a Single Trajectory , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.
[24] Robert L. Winkler,et al. The Optimizer's Curse: Skepticism and Postdecision Surprise in Decision Analysis , 2006, Manag. Sci..
[25] S. Levine,et al. Conservative Q-Learning for Offline Reinforcement Learning , 2020, NeurIPS.
[26] Martin A. Riedmiller. Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.
[27] Prabhat Nagarajan,et al. Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations , 2019, ICML.
[28] Emma Brunskill,et al. Provably Good Batch Reinforcement Learning Without Great Exploration , 2020, ArXiv.
[29] Natasha Jaques,et al. Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog , 2019, ArXiv.
[30] Dale Schuurmans,et al. Striving for Simplicity in Off-policy Deep Reinforcement Learning , 2019, ArXiv.
[31] Robert Givan,et al. Bounded-parameter Markov decision processes , 2000, Artif. Intell..
[32] Garud Iyengar,et al. Robust Dynamic Programming , 2005, Math. Oper. Res..
[33] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.
[34] John Langford,et al. Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.
[35] Laurent El Ghaoui,et al. Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..
[36] Marek Petrik,et al. Safe Policy Improvement by Minimizing Robust Baseline Regret , 2016, NIPS.
[37] Martin A. Riedmiller,et al. Batch Reinforcement Learning , 2012, Reinforcement Learning.
[38] Lantao Yu,et al. MOPO: Model-based Offline Policy Optimization , 2020, NeurIPS.
[39] S. Levine,et al. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems , 2020, ArXiv.
[40] Thorsten Joachims,et al. Batch learning from logged bandit feedback through counterfactual risk minimization , 2015, J. Mach. Learn. Res..
[41] Romain Laroche,et al. Safe Policy Improvement with Soft Baseline Bootstrapping , 2019, ECML/PKDD.
[42] Tian Tian,et al. MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments , 2019 .
[43] Romain Laroche,et al. Safe Policy Improvement with an Estimated Baseline Policy , 2020, AAMAS.
[44] Jeffrey Pennington,et al. Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks , 2020, ICLR.
[45] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..
[46] Thomas G. Dietterich,et al. PAC optimal MDP planning with application to invasive species management , 2015, J. Mach. Learn. Res..
[47] Nicolas Le Roux,et al. The Value Function Polytope in Reinforcement Learning , 2019, ICML.
[48] Sergey Levine,et al. Stabilizing Off-Policy Q-Learning via Bootstrapping Error Reduction , 2019, NeurIPS.
[49] Csaba Szepesvari,et al. Bandit Algorithms , 2020 .
[50] Massimiliano Pontil,et al. Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.
[51] Michael L. Littman,et al. An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..