Online Apprenticeship Learning

In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function. Instead, we observe trajectories sampled by an expert that acts according to some policy. The goal is to find a policy that matches the expert’s performance on some predefined set of cost functions. We introduce an online variant of AL (Online Apprenticeship Learning; OAL), where the agent is expected to perform comparably to the expert while interacting with the environment. We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms: one for policy optimization and another for learning the worst case cost. By employing optimistic exploration, we derive a convergent algorithm with O( √ K) regret, where K is the number of interactions with the MDP, and an additional linear error term that depends on the amount of expert trajectories available. Importantly, our algorithm avoids the need to solve an MDP at each iteration, making it more practical compared to prior AL methods. Finally, we implement a deep variant of our algorithm which shares some similarities to GAIL (Ho and Ermon 2016), but where the discriminator is replaced with the costs learned by the OAL problem. Our simulations suggest that OAL performs well in high dimensional control problems.

[1]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.

[2]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[3]  Jun-Kun Wang,et al.  On Frank-Wolfe and Equilibrium Computation , 2017, NIPS.

[4]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[5]  Chi Jin,et al.  Provably Efficient Exploration in Policy Optimization , 2019, ICML.

[6]  Tor Lattimore,et al.  Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning , 2017, NIPS.

[7]  Yufeng Zhang,et al.  Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate , 2020, ArXiv.

[8]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[9]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[10]  Lin F. Yang,et al.  Toward the Fundamental Limits of Imitation Learning , 2020, NeurIPS.

[11]  Alexandros Kalousis,et al.  Sample-Efficient Imitation Learning via Generative Adversarial Nets , 2018, AISTATS.

[12]  Emma Brunskill,et al.  Tighter Problem-Dependent Regret Bounds in Reinforcement Learning without Domain Knowledge using Value Function Bounds , 2019, ICML.

[13]  J. Andrew Bagnell,et al.  Efficient Reductions for Imitation Learning , 2010, AISTATS.

[14]  Robert E. Schapire,et al.  A Game-Theoretic Approach to Apprenticeship Learning , 2007, NIPS.

[15]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[16]  Tuo Zhao,et al.  On Computation and Generalization of Generative Adversarial Imitation Learning , 2020, ICLR.

[17]  Manfred K. Warmuth,et al.  The Weighted Majority Algorithm , 1994, Inf. Comput..

[18]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[19]  Matthieu Geist,et al.  A Theory of Regularized Markov Decision Processes , 2019, ICML.

[20]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[21]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[24]  Shie Mannor,et al.  Adaptive Trust Region Policy Optimization: Global Convergence and Faster Rates for Regularized MDPs , 2020, AAAI.

[25]  Mohammad Ghavamzadeh,et al.  Mirror Descent Policy Optimization , 2020, ArXiv.

[26]  Peter Auer,et al.  Near-optimal Regret Bounds for Reinforcement Learning , 2008, J. Mach. Learn. Res..

[27]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[28]  Michael I. Jordan,et al.  Provably Efficient Reinforcement Learning with Linear Function Approximation , 2019, COLT.

[29]  Matthieu Geist,et al.  Local Policy Search in a Convex Space and Conservative Policy Iteration as Boosted Policy Search , 2014, ECML/PKDD.

[30]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[31]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[32]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[33]  Haim Kaplan,et al.  Apprenticeship Learning via Frank-Wolfe , 2019, AAAI.

[34]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[35]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[36]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[37]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[38]  Y. Freund,et al.  Adaptive game playing using multiplicative weights , 1999 .

[39]  S. Kakade,et al.  Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes , 2019, COLT.

[40]  E. Ordentlich,et al.  Inequalities for the L1 Deviation of the Empirical Distribution , 2003 .

[41]  Shie Mannor,et al.  Tight Regret Bounds for Model-Based Reinforcement Learning with Greedy Policies , 2019, NeurIPS.

[42]  Shie Mannor,et al.  Optimistic Policy Optimization with Bandit Feedback , 2020, ICML.

[43]  Michael H. Bowling,et al.  Apprenticeship learning using linear programming , 2008, ICML '08.

[44]  Huang Xiao,et al.  Wasserstein Adversarial Imitation Learning , 2019, ArXiv.

[45]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[46]  Zhiheng Li,et al.  Wasserstein Distance guided Adversarial Imitation Learning with Reward Shape Exploration , 2020, 2020 IEEE 9th Data Driven Control and Learning Systems Conference (DDCLS).