An EM Algorithm for Asynchronous Input/Output Hidden Markov Models

In learning tasks in which input sequences are mapped to output sequences it is often the case that the input and output sequences are not synchronous For example in speech recognition acoustic sequences are longer than phoneme sequences Input Output Hidden Markov Models have already been proposed to represent the distribution of an output sequence given an input sequence of the same length We extend here this model to the case of asynchronous sequences and show an Expectation Maximization algorithm for training such models Introduction Supervised learning algorithms for sequential data minimize a training criterion that depends on pairs of input and output sequences It is often assumed that input and output sequences are synchronized i e that each input sequence has the same length as the corresponding output sequence For instance recurrent networks Rumelhart et al can be used to map input sequences to output sequences for example minimizing at each time step the squared di erence between the actual output and the desired output Another example is a recently proposed recurrent mixture of experts connectionist ar chitecture which has an interpretation as a probabilistic model called Input Output Hidden Markov Model IOHMM Bengio and Frasconi Bengio and Frasconi This model represents the distribution of an output sequence when given an input sequence of the same length using a hidden state variable and a Markovian independence assumption as in Hidden Markov Models HMMs Levinson et al Rabiner in order to simplify the distribution IOHMMs are a form of probabilistic transducers Pereira et al Singer with input and output variables which can be discrete as well as continuous valued However in many sequential problems where one tries to map an input sequence to an output sequence the length of the input and output sequences may not be equal Input and output sequences could behave at di erent time scales For example in a speech recognition problem where one wants to map an acoustic signal to a phoneme sequence each phoneme approximately corresponds to a subsequence of the acoustic signal therefore the input acoustic sequence is generally longer than the output phoneme sequence and the alignment between inputs and outputs is often not available In comparison with HMMs emission and transition probabilities in IOHMMs vary with time in function of an input sequence Unlike HMMs IOHMMs with discrete outputs are discriminant models Furthermore the transition probabilities and emission probabilities are generally better matched which reduces a problem observed in speech recognition HMMs because outputs are in a much higher dimensional space than transitions in HMMs the dynamic range of transition probabilities is much less than that of emission probabilities Therefore the choice between di erent paths during recognition is mostly in uenced by emission rather than transition probabilities In this paper we present an extension of IOHMMs to the asynchronous case We rst present the proba bilistic model then derive an exact Expectation Maximization EM algorithm for training asynchronous IOHMMs For complex distributions e g using arti cial neural networks to represent transition and emission distributions a Generalized EM algorithm or gradient ascent in likelihood can be used Finally a recognition algorithm similar to the Viterbi algorithm is presented to map given input sequences to likely output sequences The Model Let us note u for input sequences u u uT and similarly y S for output sequences y y yS In this paper we consider the case in which the output sequences are shorter than the input sequences The more general case is a straightforward extension of this model using empty transitions that do not take any time and will be discussed elsewhere As in HMMs and IOHMMs we introduce a discrete hidden state variable xt which will allow us to simplify the distribution P y ju T by using Markovian independence assumptions The state sequence x is taken to be synchronous with the input sequence u In order to produce output sequences shorter than input sequences we will have states that do not emit an output as well as states that do emit an output When at time t the system is in a non emitting state no output can be produced Therefore there exists many sequences of states corresponding to di erent shorter length output sequences When conceived as a generative model of the output given the input an asynchronous IOHMM works as follows At time t an initial state x is chosen according to the distribution P x and the length of the output sequence s is initialized to At other time steps t a state xt is rst picked according to the transition distribution P xtjxt ut using the state at the previous time step xt and the current input ut If xt is an emitting state then the length of the output sequence is increased from s to s and the sth output ys is sampled from the emission distribution P ysjxt ut The parameters of the model are thus the initial state probabilities i P x i and the parameters of the output and transition conditional distribution models P ysjxt ut and P xtjxt ut Since the input and output sequences are of di erent lengths we will introduce another hidden variable t speci cally to represent the alignment between inputs and outputs with t s meaning that s outputs have been emitted at time t Let us rst formalize the independence assumptions and the form of the conditional distribution rep resented by the model The conditional probability P y ju T can be written as a sum of terms P y x T T ju T over all possible state sequences x T such that the number of emitting states in each of these sequences is S the length of the output sequence P y ju T X x T S P y x T T ju T All S outputs must have been emitted by time T so T S The hidden state xt takes discrete values in a nite set Each of the terms P y x T T ju T corresponds to a particular sequence of states and a corresponding alignment this probability can be written as the initial state probabilities P x times a product of factors over all the time steps t if state xt i is an emitting state that factor is P xtjxt ut P ysjxt ut otherwise that factor is simply P xtjxt ut where s is the position in the output sequence of the output emitted at time t when an output is emitted at time t We summarize in table the notation we have introduced and de ne additional notation used in this paper Table Notation used in the paper S size of the output sequence T size of the input sequence N number of states in the IOHMM a i j t output of the module that computes P xt ijxt j ut b i s t output of the module that computes P ysjxt i ut i P x i initial probability of state i zi t if xt i zi t otherwise These indicator variables give the state sequence ms t if the system emits the s th output at time t ms t otherwise These indicator variables give the input output alignment ei is true if state i emits so P ei ei is false otherwise t s means that the rst s rst outputs have been emitted at time t t k if the t th input symbol is k t k otherwise s k if the s th output symbol is k s k otherwise pred i is the set of all the predecessors states of state i succ i is the set of all the successors states of state i The Markovian conditional independence assumptions in this model mean that the state variable xt summarizes su ciently the past of the sequence so P xtjx t u t P xtjxt ut and P ysjx t u t P ysjxt ut These assumptions are analogous to the Markovian independence assumptions used in HMMs and are the same as in synchronous IOHMMs Based on these two assumptions the conditional probability can be e ciently represented and computed recursively using an intermediate variable i s t def P xt i t s y s ju t The conditional probability of an output sequence can be expressed in terms of this variable L def P y ju T X