Learning from Demonstrations and Human Evaluative Feedbacks: Handling Sparsity and Imperfection Using Inverse Reinforcement Learning Approach

Programming by demonstrations is one of the most efficient methods for knowledge transfer to develop advanced learning systems, provided that teachers deliver abundant and correct demonstrations, and learners correctly perceive them. Nevertheless, demonstrations are sparse and inaccurate in almost all real-world problems. Complementary information is needed to compensate these shortcomings of demonstrations. In this paper, we target programming by a combination of nonoptimal and sparse demonstrations and a limited number of binary evaluative feedbacks, where the learner uses its own evaluated experiences as new demonstrations in an extended inverse reinforcement learning method. This provides the learner with a broader generalization and less regret as well as robustness in face of sparsity and nonoptimality in demonstrations and feedbacks. Our method alleviates the unrealistic burden on teachers to provide optimal and abundant demonstrations. Employing an evaluative feedback, which is easy for teachers to deliver, provides the opportunity to correct the learner’s behavior in an interactive social setting without requiring teachers to know and use their own accurate reward function. Here, we enhance the inverse reinforcement learning ( ) to estimate the reward function using a mixture of nonoptimal and sparse demonstrations and evaluative feedbacks. Our method, called from demonstration and human’s critique ( ), has two phases. The teacher first provides some demonstrations for the learner to initialize its policy. Next, the learner interacts with the environment and the teacher provides binary evaluative feedbacks. Taking into account possible inconsistencies and mistakes in issuing and receiving feedbacks, the learner revises the estimated reward function by solving a single optimization problem. The is devised to handle errors and sparsities in demonstrations and feedbacks and can generalize different combinations of these two sources expertise. We apply our method to three domains: a simulated navigation task, a simulated car driving problem with human interactions, and a navigation experiment of a mobile robot. The results indicate that the significantly enhances the learning process where the standard methods fail and learning from feedbacks ( ) methods has a high regret. Also, the works well at different levels of sparsity and optimality of the teacher’s demonstrations and feedbacks, where other state-of-the-art methods fail.

[1]  Yuchen Cui,et al.  Risk-Aware Active Inverse Reinforcement Learning , 2018, CoRL.

[2]  Matthieu Geist,et al.  A Cascaded Supervised Learning Approach to Inverse Reinforcement Learning , 2013, ECML/PKDD.

[3]  David L. Roberts,et al.  Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning , 2015, Autonomous Agents and Multi-Agent Systems.

[4]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[5]  Shimon Whiteson,et al.  Inverse Reinforcement Learning from Failure , 2016, AAMAS.

[6]  Tim Oates,et al.  Neuroevolution-based Inverse Reinforcement Learning , 2016, 2017 IEEE Congress on Evolutionary Computation (CEC).

[7]  Sonia Chernova,et al.  Learning from Demonstration for Shaping through Inverse Reinforcement Learning , 2016, AAMAS.

[8]  Heinz Koeppl,et al.  Inverse Reinforcement Learning in Swarm Systems , 2016, AAMAS.

[9]  Michael L. Littman,et al.  Between Imitation and Intention Learning , 2015, IJCAI.

[10]  Ian R. Fasel,et al.  Design Principles for Creating Human-Shapable Agents , 2009, AAAI Spring Symposium: Agents that Learn from Human Teachers.

[11]  Andrea Lockerd Thomaz,et al.  Policy Shaping: Integrating Human Feedback with Reinforcement Learning , 2013, NIPS.

[12]  Sergey Levine,et al.  Continuous Inverse Optimal Control with Locally Optimal Examples , 2012, ICML.

[13]  Michèle Sebag,et al.  Preference-Based Policy Learning , 2011, ECML/PKDD.

[14]  Manuela M. Veloso,et al.  Confidence-based policy learning from demonstration using Gaussian mixture models , 2007, AAMAS '07.

[15]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[16]  Peter Stone,et al.  Combining manual feedback with subsequent MDP reward signals for reinforcement learning , 2010, AAMAS.

[17]  Michael L. Littman,et al.  Apprenticeship Learning About Multiple Intentions , 2011, ICML.

[18]  Abdelkader El Kamel,et al.  Neural inverse reinforcement learning in autonomous navigation , 2016, Robotics Auton. Syst..

[19]  Andrea Lockerd Thomaz,et al.  Policy Shaping with Human Teachers , 2015, IJCAI.

[20]  Andrea Lockerd Thomaz,et al.  Robot Learning from Human Teachers , 2014, Robot Learning from Human Teachers.

[21]  Brett Browning,et al.  Learning by demonstration with critique from a human teacher , 2007, 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[22]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[23]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[24]  Mohamed Medhat Gaber,et al.  Imitation Learning , 2017, ACM Comput. Surv..

[25]  David L. Roberts,et al.  A Need for Speed: Adapting Agent Action Speed to Improve Task Learning from Non-Expert Humans , 2016, AAMAS.

[26]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[27]  Majid Nili Ahmadabadi,et al.  Combination of learning from non-optimal demonstrations and feedbacks using inverse reinforcement learning and Bayesian policy improvement , 2018, Expert Syst. Appl..

[28]  Aude Billard,et al.  Recognition and reproduction of gestures using a probabilistic framework combining PCA, ICA and HMM , 2005, ICML.

[29]  David Silver,et al.  Learning from Demonstration for Autonomous Navigation in Complex Unstructured Terrain , 2010, Int. J. Robotics Res..

[30]  Shimon Whiteson,et al.  Using informative behavior to increase engagement in the tamer framework , 2013, AAMAS.

[31]  Matthieu Geist,et al.  Inverse Reinforcement Learning through Structured Classification , 2012, NIPS.

[32]  Manuel Lopes,et al.  Multi-class Generalized Binary Search for Active Inverse Reinforcement Learning , 2013, ArXiv.

[33]  Monica N. Nicolescu,et al.  Natural methods for robot task learning: instructive demonstrations, generalization and practice , 2003, AAMAS '03.

[34]  Sriraam Natarajan,et al.  Active Advice Seeking for Inverse Reinforcement Learning , 2015, AAAI.

[35]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[36]  Joelle Pineau,et al.  Learning from Limited Demonstrations , 2013, NIPS.

[37]  Thomas G. Dietterich,et al.  Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[38]  Michèle Sebag,et al.  Programming by Feedback , 2014, ICML.

[39]  Eyal Amir,et al.  Bayesian Inverse Reinforcement Learning , 2007, IJCAI.

[40]  Siyuan Liu,et al.  Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise , 2014, AAAI.

[41]  Pieter Abbeel,et al.  Learning for control from multiple demonstrations , 2008, ICML '08.

[42]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[43]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[44]  Manuela M. Veloso,et al.  Interactive Policy Learning through Confidence-Based Autonomy , 2014, J. Artif. Intell. Res..

[45]  Andrea Lockerd Thomaz,et al.  Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[46]  Manuel Lopes,et al.  Active Learning for Reward Estimation in Inverse Reinforcement Learning , 2009, ECML/PKDD.

[47]  TaeChoong Chung,et al.  Learning via human feedback in continuous state and action spaces , 2013, Applied Intelligence.

[48]  Stefano Ermon,et al.  Model-Free Imitation Learning with Policy Optimization , 2016, ICML.

[49]  Thorsten Joachims,et al.  Learning Trajectory Preferences for Manipulators via Iterative Improvement , 2013, NIPS.

[50]  Edmund H. Durfee,et al.  Comparing Action-Query Strategies in Semi-Autonomous Agents , 2011, AAAI.

[51]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[52]  J A Bagnell,et al.  An Invitation to Imitation , 2015 .

[53]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[54]  Francesco Mondada,et al.  The e-puck, a Robot Designed for Education in Engineering , 2009 .

[55]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[56]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[57]  Sriraam Natarajan,et al.  Guiding Autonomous Agents to Better Behaviors through Human Advice , 2013, 2013 IEEE 13th International Conference on Data Mining.