Recurrent Neural Networks and Pitch Representations for Music Tasks

We present results from experiments in using several pitch representations for jazz-oriented musical tasks performed by a recurrent neural network. We have run experiments with several kinds of recurrent networks for this purpose, and have found that Long Short-term Memory networks provide the best results. We show that a new pitch representation called Circles of Thirds works as well as two other published representations for these tasks, yet it is more succinct and enables faster learning. Recurrent Neural Networks and Music Many researchers are familiar with feedforward neural networks consisting of 2 or more layers of processing units, each with weighted connections to the next layer. Each unit passes the sum of its weighted inputs through a nonlinear sigmoid function. Each layer’s outputs are fed forward through the network to the next layer, until the output layer is reached. Weights are initialized to small initial random values. Via the back-propagation algorithm (Rumelhart et al. 1986), outputs are compared to targets, and the errors are propagated back through the connection weights. Weights are updated by gradient descent. Through an iterative training procedure, examples (inputs) and targets are presented repeatedly; the network learns a nonlinear function of the inputs. It can then generalize and produce outputs for new examples. These networks have been explored by the computer music community for classifying chords (Laden and Keefe 1991) and other musical tasks (Todd and Loy 1991, Griffith and Todd 1999). A recurrent network uses feedback from one or more of its units as input in choosing the next output. This means that values generated by units at time step t-1, say y(t-1), are part of the inputs x(t) used in selecting the next set of outputs y(t). A network may be fully recurrent; that is all units are connected back to each other and to themselves. Or part of the network may be fed back in recurrent links. Todd (Todd 1991) uses a Jordan recurrent network (Jordan 1986) to reproduce classical songs and then to produce new songs. The outputs are recurrently fed back as inputs as shown in Figure 1. In addition, self-recurrence on the inputs provides a decaying history of these inputs. The weight update algorithm is back-propagation, using teacher forcing (Williams and Zipser 1988). With teacher forcing, the target outputs are presented to the recurrent inputs from the output units (instead of the actual outputs, which are not correct yet during training). Pitches (on output or input) are represented in a localized binary representation, with one bit for each of the 12 chromatic notes. More bits can be added for more octaves. C is represented as 100000000000. C# is 010000000000, D is 001000000000. Time is divided into 16th note increments. Note durations are determined by how many increments a pitch’s output unit is on (one). E.g. an eighth note lasts for two time increments. Rests occur when all outputs are off (zero). Figure 1. Jordan network, with outputs fed back to inputs. (Mozer 1994)’s CONCERT uses a backpropagationthrough-time (BPTT) recurrent network to learn various musical tasks and to learn melodies with harmonic accompaniment. Then, CONCERT can run in generation mode to compose new music. The BPTT algorithm (Williams and Zipser 1992, Werbos 1988, Campolucci 1998) can be used with a fully recurrent network where the outputs of all units are connected to the inputs of all units, including themselves. The network can include external inputs and optionally, may include a regular feedforward output network (see Figure 2). The BPTT weight updates are proportional to the gradient of the sum of errors over every time step in the interval between start time t0 and end time t1, assuming the error at time step t is affected by the outputs at all previous time steps, starting with t0. BPTT requires saving all inputs, states, and errors for all time steps, and updating the weights in a batch operation at the end, time t1. One sequence (each example) requires one batch weight update. Figure 2. A fully self-recurrent network with external inputs, and optional feedforward output attachment. If there is no output attachment, one or more recurrent units are designated as output units. CONCERT is a combination of BPTT with a layer of output units that are probabilistically interpreted, and a maximum likelihood training criterion (rather than a squared error criterion). There are two sets of outputs (and two sets of inputs), one set for pitch and the other for duration. One pass through the network corresponds to a note, rather than a slice of time. We present only the pitch representation here since that is our focus. Mozer uses a psychologically based representation of musical notes. Figure 3 shows the chromatic circle (CC) and the circle of fifths (CF), used with a linear octave value for CONCERT’s pitch representation. Ignoring octaves, we refer to the rest of the representation as CCCF. Six digits represent the position of a pitch on CC and six more its position on CF. C is represented as 000000 000000, C# as 000001 111110, D as 000011 111111, and so on. Mozer uses -1,1 rather than 0,1 because of implementation details. Figure 3. Chromatic Circle on Left, Circle of Fifths on Right. Pitch position on each circle determines its representation. For chords, CONCERT uses the overlapping subharmonics representation of (Laden and Keefe, 1991). Each chord tone starts in Todd’s binary representation, but 5 harmonics (integer multiples of its frequency) are added. C3 is now C3, C4, G4, C5, E5 requiring a 3 octave representation. Because the 7th of the chord does not overlap with the triad harmonics, Laden and Keefe use triads only. C major triad C3, E3, G3, with harmonics, is C3, C4, G4, C5, E5, E3, E4, B4, E5, G#5, G3, G4, D4, G5, B5. The triad pitches and harmonics give an overlapping representation. Each overlapping pitch adds 1 to its corresponding input. CONCERT excludes octaves, leaving 12 highly overlapping chord inputs, plus an input that is positive when certain key-dependent chords appear, and learns waltzes over a harmonic chord structure. Eck and Schmidhuber (2002) use Long Short-term Memory (LSTM) recurrent networks to learn and compose blues music (Hochreiter and Schmidhuber 1997, and see Gers et al., 2000 for succinct pseudo-code for the algorithm). An LSTM network consists of input units, output units, and a set of memory blocks, each of which includes one or more memory cells. Blocks are connected to each other recurrently. Figure 4 shows an LSTM network on the left, and the contents of one memory block (this one with one cell) on the right. There may also be a direct connection from external inputs to the output units. This is the configuration found in Gers et al., and the one we use in our experiments. Eck and Schmidhuber also add recurrent connections from output units to memory blocks. Each block contains one or more memory cells that are self-recurrent. All other units in the block gate the inputs, outputs, and the memory cell itself. A memory cell can “cache” errors and release them for weight updates much later in time. The gates can learn to delay a block’s outputs, to reset the memory cells, and to inhibit inputs from reaching the cell or to allow inputs in. Figure 4. An LSTM network on the left and a one-cell memory block on the right, with input, forget, and output gates. Black squares on gate connections show that the gates can control whether information is passed to the cell, from the cell, or even within the cell. Weight updates are based on gradient descent, with multiplicative gradient calculations for gates, and approximations from the truncated BPTT (Williams and Peng 1990) and Real-Time Recurrent Learning (RTRL) (Robinson and Fallside 1987) algorithms. LSTM networks are able to perform counting tasks in time-series. Eck and Schmidhuber’s model of blues music is a 12-bar chord sequence over which music is composed/improvised. They successfully trained an LSTM network to learn a sequence of blues chords, with varying durations. Splitting time into 8th note increments, each chord’s duration is either 8 or 4 time steps (whole or half durations). Chords are sets of 3 or 4 tones (triads or triads plus sevenths), represented in a 12-bit localized binary representation with values of 1 for a chord pitch, and 0 for a non-chord pitch. Chords are inverted to fit in 1 octave. For example, C7 is represented as 100010010010 (C,E,G,B-flat), and F7 is 100101000100 (F,A,C,E-flat inverted to C,E-flat,F,A). The network has 4 memory blocks, each containing 2 cells. The outputs are considered probabilities of whether the corresponding note is on or off. The goal is to obtain an output of more that .5 for each note that should be on in a particular chord, with all other outputs below .5. Eck and Schmidhuber’s work includes learning melody and chords with two LSTM networks containing 4 blocks each. Connections are made from the chord network to the melody network, but not vice versa. The authors composed short 1-bar melodies over each of the 12 possible bars. The network is trained on concatenations of the short melodies over the 12-bar blues chord sequence. The melody network is trained until the chords network has learned according to the criterion. In music generation mode, the network can generate new melodies using this training. In a system called CHIME (Franklin 2000, 2001), we first train a Jordan recurrent network (Figure 1) to produce 3 Sonny Rollins jazz/blues melodies. The current chord and index number of the song are non-recurrent inputs to the network. Chords are represented as sets of 4 note values of 1 in a 12-note input layer, with non-chord note inputs set to 0 just as in Eck and Schmidhuber’s chord representation. Chords are also inverted to fit within one octave. 24 (2 octaves) of the outputs are notes, and the 25th is a rest. Of these 25, the unit with the largest value

[1]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[2]  Louis P. DiPalma,et al.  Music and Connectionism , 1991 .

[3]  Douglas H. Keefe,et al.  The Representation of Pitch in a Neural Net Model of Chord Classification , 1989 .

[4]  Judy A. Franklin Multi-Phase Learning for Jazz Improvisation and Interaction , 2001 .

[5]  PAUL J. WERBOS,et al.  Generalization of backpropagation with application to a recurrent gas market model , 1988, Neural Networks.

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[8]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[9]  Michael C. Mozer,et al.  Neural Network Music Composition by Prediction: Exploring the Benefits of Psychoacoustic Constraints and Multi-scale Processing , 1994, Connect. Sci..

[10]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[11]  Paolo Campolucci,et al.  A Circuit Theory Approach to Recurrent Neural Network Architectures and Learning Methods , 1998 .

[12]  Jürgen Schmidhuber,et al.  Learning the Long-Term Structure of the Blues , 2002, ICANN.

[13]  Peter M. Todd,et al.  Musical networks , 1999 .

[14]  Peter M. Todd,et al.  A Connectionist Approach To Algorithmic Composition , 1989 .

[15]  P. Todd,et al.  Musical networks: Parallel distributed perception and performance , 1999 .

[16]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.