论文信息 - Between Instruction and Reward: Human-Prompted Switching

Between Instruction and Reward: Human-Prompted Switching

Intelligent systems promise to amplify, augment, and extend innate human abilities. A principal example is that of assistive rehabilitation robots—artificial intelligence and machine learning enable new electromechanical systems that restore biological functions lost through injury or illness. In order for an intelligent machine to assist a human user, it must be possible for a human to communicate their intentions and preferences to their non-human counterpart. While there are a number of techniques that a human can use to direct a machine learning system, most research to date has focused on the contrasting strategies of instruction and reward. The primary contribution of our work is to demonstrate that the middle ground between instruction and reward is a fertile space for research and immediate technological progress. To support this idea, we introduce the setting of human-prompted switching, and illustrate the successful combination of switching with interactive learning using a concrete real-world example: human control of a multi-joint robot arm. We believe techniques that fall between the domains of instruction and reward are complementary to existing approaches, and will open up new lines of rapid progress for interactive human training of machine learning systems. Smarter, Stronger, More Productive Humans make use of automated resources to augment and extend our physical and cognitive abilities. Machine-based augmentation is especially prominent in the setting of rehabilitation medicine—assistive devices like artificial limbs and cochlear implants have taken on a central role in restoring biological functions lost through injury, illness, or congenital complications. In particular, robotic prostheses have made significant improvements to the quality of life and functional abilities achievable by amputees (Williams 2011). However, as prosthetic devices increase in power and complexity, there is a resulting increase in the complexity of the control interface that binds a prosthesis to a human user. Despite the potential for improved abilities, many amputees find the control of multi-function robotic limbs frustrating and confusing; non-intuitive control is a principal cause of prosthesis rejection by amputees (Peerdeman et al. 2011). Starting with work in the 1960s, a number of increasingly successful control paradigms have been developed to help amputees direct their powered robotic prostheses. While classical control remains the mainstay for current commercial prostheses, machine learning has provided some of the most successful methods for controlling next-generation robot limbs. Examples of machine learning for multifunction prosthesis control include support vector machines, artificial neural networks, linear discriminant analysis, and reinforcement learning (Scheme and Englehart 2011; Micera, Carpaneto, and Raspopovic 2010; Pilarski et al. 2011, 2012). The use of artificial intelligence and machine learning is a natural trajectory for automation: in an applications context, we strive to make machines more intelligent so that we can improve our control abilities, achieving greater power and precision when addressing our goals. Directing an Intelligent System One consequence of human-machine collaboration is that humans must find ways to successfully communicate their intentions and goals to learning machines. Humans must take on the challenge of directing intelligent automated systems. Interaction is one way of addressing this challenge. Through ongoing interactions, a human can direct and mould the operation of a learning system to more closely match his or her intentions. Information from a human trainer has been shown to allow a system to achieve arbitrary user-centric goals, improve a system’s learning speed, increase asymptotic performance, overcome local optima, and beneficially direct a system’s exploration (Judah et al. 2010; Kaplan et al. 2002; Knox and Stone 2012; Lin 1991–1993; Pilarski et al. 2011; Thomaz and Breazeal 2008). It is natural to expect that providing added instructional information to a learning system will help drive the learning process (Lin 1992; Thomaz and Breazeal 2008). Interactive teaching is a dominant approach to human and animal learning, and techniques from these biological domains seem to transfer well to the machine learning case. Building on a basis in biological learning, many approaches operate within the framework of reinforcement learning (Sutton and Barto 1998) and deliver direction by way of generalized scalar feedback signals known as reward; others provide explicit instruction in the form of demonstrations, performance critiques, or semantically dense training interactions. The use of reward and instruction during interactive learning has produced a number of important milestones. PreviFigure 1: The continuum of interactive training and direction methods. There are a number of ways that human-generated signals have been used to direct a learning machine. We can characterize these interactions as lying within a twodimensional space. One dimension corresponds to how explicit the human signals are and the other corresponds to the overall bandwidth or information density of the signals. Most application domains lie along the diagonal between full autonomy and full control, shown in red. ous work has shown how trial-and-error machine learning can be enabled or accelerated through the presentation of human-delivered rewards and intermediate reinforcements. Examples include the use of shaping signals (Kaplan et al. 2002), the combination of human and environmental reward (Knox and Stone 2012), multi-signal reinforcement (Thomaz and Breazeal 2008), and our preliminary work on human-directed prosthesis controllers (Pilarski et al. 2011). The presentation of interactive learning demonstrations or instructions has also been shown to help teach a collection of viable sub-policies even when a globally optimal policy is challenging to achieve (e.g., Chao, Cakmak, and Thomaz 2010; Judah et al. 2010; Kaplan et al. 2002; Lin 1991–1993). As such, leading approaches to the human training of a machine learner almost exclusively involve the presentation of new information in the form of instruction or reward. These human directions and examples supplement the signals already occurring in a machine learner’s sensorimotor stream. Work on instruction and reward is representative of a growing body of literature on interactive learning, and there are a number of ways that non-interactive human guidance has been used to direct learning machines. We suggest that the continuum of human training and direction methods can be usefully separated along three main axes: Explicitness: Explicitness describes the degree to which the signals from a human user contain explicit semantics, and relates to the detail of voluntary human involvement. At one end of this axis is reward, as in the reinforcement learning case of a scalar feedback signal (e.g., Knox and Stone 2012) or a binary shaping signal (e.g., Kaplan et al. 2002). At the other extreme is instruction in the form of demonstration learning and supervisory control (e.g., Lin 1991–1993), performance critiques following a period of action by the learner (e.g., Judah et al. 2010), and the Socially Guided Machine Learning of Chao, Cakmak, and Thomaz (2010). Bandwidth: Bandwidth refers to the rate and density with which information is passed to a learning system, in terms of signal frequency, signal complexity (binary, scalar, vector, nominal), and the number of distinct signalling channels. Directive information may be as simple as a single binary reinforcement or shaping signal (Kaplan et al. 2002), can involve multiple signals or signal types being presented to the learning system (Thomaz and Breazeal 2008), or can involve processing with verbal or non-verbal cues (Chao, Cakmak, and Thomaz 2010). Signals may be presented regularly during real-time operation (Knox and Stone 2012; Pilarski et al. 2011), or may be sparse and irregularly spaced with gaps between signals (e.g., Chao, Cakmak, and Thomaz 2010). Immediacy: Interaction may vary in terms of its timeliness, from real-time interactions between a human and a learner, to episodic, asynchronous, or offline interactions. Highly interactive direction involves instantaneous or immediate feedback signals about what the agent has done or is about to do (Kaplan et al. 2002; Thomaz and Breazeal 2008; Knox and Stone 2012). Less interactive direction occurs when human signals are presented before or after a learner’s operation—e.g., the a priori presentation of temporally extended training sequences (Lin 1991–1993) or a posteriori performance evaluation (Judah et al. 2010). Fixed control schemes, such as classical PID control and pre-defined reward functions, occupy the far end of the immediacy axis. We are interested in human-robot control settings where a machine learner improves through ongoing, real-time interaction with the human user over an extended period of time. Artificial limbs and assistive rehabilitation devices fall into this category. As such, for the remainder of this work we will deal with the case of interactive direction and therefore focus on ideas of bandwidth and explicitness. The two dimensional space formed by combining bandwidth and explicitness is shown in Figure 1. The bottom left of this continuum represents fully autonomous operation (no human direction), while the top right represents full human control (explicit high-bandwidth supervision; no automation). The notion of sliding control between a human and an autonomous system can also be represented on the continuum shown in Figure 1. As one example, a reduction in the number or frequency of signals needed from a human user takes the form of a shift in communication bandwidth (Figure 1, red arrows pointing left). A critical region of the bandwidth/explicitness continuum is the spectrum we define as the degree of

Patrick M. Pilarski | Richard S. Sutton | R. Sutton | P. Pilarski

[1] Long Ji Lin,et al. Self-improving reactive agents based on reinforcement learning, planning and teaching , 1992, Machine Learning.

[2] Stefano Stramigioli,et al. Myoelectric forearm prostheses: state of the art from a user-centered perspective. , 2011, Journal of rehabilitation research and development.

[3] T Walley Williams,et al. Progress on stabilizing and controlling powered upper-limb prostheses. , 2011, Journal of rehabilitation research and development.

[4] Pierre-Yves Oudeyer,et al. Robotic clicker training , 2002, Robotics Auton. Syst..

[5] Richard S. Sutton,et al. Multi-timescale nexting in a reinforcement learning robot , 2011, Adapt. Behav..

[6] Thomas G. Dietterich,et al. Reinforcement Learning Via Practice and Critique Advice , 2010, AAAI.

[7] R. S. Sutton,et al. Dynamic switching and real-time machine learning for improved human control of assistive biomedical robots , 2012, 2012 4th IEEE RAS & EMBS International Conference on Biomedical Robotics and Biomechatronics (BioRob).

[8] Andrea Lockerd Thomaz,et al. Teachable robots: Understanding human teaching behavior to build more effective robot learners , 2008, Artif. Intell..

[9] L.-J. Lin,et al. Hierarchical learning of robot skills by reinforcement , 1993, IEEE International Conference on Neural Networks.

[10] Farbod Fahimi,et al. Online human training of a myoelectric prosthesis controller via actor-critic reinforcement learning , 2011, 2011 IEEE International Conference on Rehabilitation Robotics.

[11] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[12] Patrick M. Pilarski,et al. Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[13] Long Ji Lin,et al. Programming Robots Using Reinforcement Learning and Teaching , 1991, AAAI.

[14] Peter Stone,et al. Reinforcement learning from simultaneous human and MDP reward , 2012, AAMAS.

[15] S Micera,et al. Control of Hand Prostheses Using Peripheral Information , 2010, IEEE Reviews in Biomedical Engineering.

[16] Maya Cakmak,et al. Transparent active learning for robots , 2010, 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[17] Farbod Fahimi,et al. The Development of a Myoelectric Training Tool for Above-Elbow Amputees , 2012, The open biomedical engineering journal.

[18] Erik Scheme,et al. Electromyogram pattern recognition for control of powered upper-limb prostheses: state of the art and challenges for clinical use. , 2011, Journal of rehabilitation research and development.