Multi-Phase Learning for Jazz Improvisation and Interaction

This article presents a model for computational learning composed of two phases that enable a machine to interactively improvise jazz with a human. To explore and demonstrate this model, a working system has been built, called CHIME for Computer Human Interacting Musical Entity. In phase 1, a recurrent neural network is used to train the machine to reproduce 3 jazz melodies. Using this knowledge, CHIME can interactively play music with a human in real time by trading fours in jazz improvisation. The machine is further trained in phase 2 with a real-valued reinforcement learning algorithm. Included in the paper are details of the mechanisms for learning and interaction and the results. The paper presentation includes real-time demonstrations of CHIME. Introduction The goal in developing CHIME was to use machine learning to enable a machine to learn to improvise jazz and to interact with a human player. Creating this kind of human/machine interaction has great aesthetic and philosophical appeal. CHIME starts as a beginner, but after training can “trade fours” with a human. That is each improvises (makes up original jazz melodies in real time) for four measures while the other listens. Then the other takes a turn and so on. These four measures are played over a chord structure. In trading fours, a player tries to incorporate some of the other player’s music while adding new material. The type of machine learning used is based in connectionist artificial networks and includes both supervised learning (recurrent back-propagation) and reinforcement learning. Connectionist or artificial neural network (ANN) learning has much to offer human-computer interaction as has been explored by (Griffith and Todd 1999, Todd and Loy 1991, and Hornel and Menzel 1998). Complex mappings between inputs and outputs can be learned. Recurrent artificial neural networks are neural networks with a feedback loop from the output units to units at one of the preceding layers. In this way the network is given information about its past actions. The networks can be trained prior to use and are also capable of real-time learning (learning while doing). Reinforcement learning offers the ability to learn using indirect feedback about the effects of the network’s outputs. CHIME demonstrates a model for computational learning that consists of phases of learning. Currently there are two phases. In phase 1, the learning method is supervised learning (recurrent back-propagation) where the machine is trained to play three jazz tunes. In supervised learning, the correct output is always known and an error is formed that is the difference between the actual output of the network and the correct output. This error is used to train the network. Phase 1 has its basis in work by (Todd 1991). After phase 1 training, the machine is able to trade fours with the human, on-line and in real time. In phase 2 the trained recurrent network is expanded and further trained via reinforcement learning. Reinforcement learning is a natural approach for learning to improvise because it uses a trial and error basis for learning that allows it to discover new solutions to problems. It does not need an exact error to learn. After this additional training the computer can also trade fours in real time. Figure 1 shows the CHIME architecture for training, learning, and real-time interaction. The top box shows phase 1, where the initial recurrent network is trained on existing melodies. The inputs to the 1 This work was funded in part by the U.S. National Science Foundation, under grant CDA-9720508 through the Human Computer Interaction Program. It was also supported (non-financially) by the John Payne Music Center in Brookline, Massachusetts, U.S.A. All opinions in the article are the author’s own. 2 network are shown as boxes on the left. In phase 1, the network input is a decaying history of its own output (the recurrent feedback link), as well as a representation of the chord over which it is improvising, and a number representing which song it is learning. After phase 1 training, when it is interacting with a human, its input is a decaying history of the human’s most recent improvisation and again the current chord. In phase 2, the network receives both the recurrent feedback link as well as the human’s improvisation. During the phase 2 training period, it can use either a recorded human improvisation, or it can be trained while interacting with a human. In both phases the network is given the current chord over which it is improvising. A reinforcement value gives some indication of how “good” or “bad” the output is. At this stage of CHIME, a set of rudimentary rules for jazz is used to obtain the reinforcement value. Figure 1. The CHIME architecture for learning, and interaction in real-time with a human. On the right of Figure 1 is 1) an adaptive critic that learns to predict reinforcement values in order to account for delays between actions and results, and 2) a mechanism to adaptively scale the reinforcement so that it can continue to learn after it has learned to perform reasonably well. Use of Recurrent Artificial Neural Networks (Todd 1991) used a recurrent network for classical melody learning and generation. The connection from output to input provides a memory of recently played notes. The network was taught, using backpropagation, to reproduce melodies. The network was able to distinguish between various melodies by noting the distinction in inputs called plan inputs. Interestingly, the network was able to generalize; when given a plan that it had not seen before, it generated a new melody. In phase 1, a recurrent network is first trained off-line using supervised learning. A diagram of a recurrent network, based on Jordan’s work (Jordan 1986), is shown in Figure 2. It is similar to that used by Todd except that it has been augmented with the chord input. There are several other more subtle changes as well. The network is composed of the input layer, the hidden layer, and the output layer. Each node in the output layer corresponds to a note that can be played, or a rest, or a new note indicator. When an output node value is above a fixed threshold it is a candidate note. A rest is included as a candidate. The output with the highest value above the threshold is chosen as the next note. The note indicator output indicates whether this is a new note (i.e. note indicator output is greater than the threshold) or if it is the same note being held for another time increment. The network has a range of two octaves (24 chromatic notes of Western tonal music). Each unit in the hidden and output layers utilizes the sum of its inputs, multiplied by weights. The output units are linear and their output is simply this sum.