Input-output HMMs for sequence processing

We consider problems of sequence processing and propose a solution based on a discrete-state model in order to represent past context. We introduce a recurrent connectionist architecture having a modular structure that associates a subnetwork to each state. The model has a statistical interpretation we call input-output hidden Markov model (IOHMM). It can be trained by the estimation-maximization (EM) or generalized EM (GEM) algorithms, considering state trajectories as missing data, which decouples temporal credit assignment and actual parameter estimation. The model presents similarities to hidden Markov models (HMMs), but allows us to map input sequences to output sequences, using the same processing style as recurrent neural networks. IOHMMs are trained using a more discriminant learning paradigm than HMMs, while potentially taking advantage of the EM algorithm. We demonstrate that IOHMMs are well suited for solving grammatical inference problems on a benchmark problem. Experimental results are presented for the seven Tomita grammars, showing that these adaptive models can attain excellent generalization.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[3]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[6]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[7]  Carl H. Smith,et al.  Inductive Inference: Theory and Methods , 1983, CSUR.

[8]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[9]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[10]  Anthony J. Robinson,et al.  Static and Dynamic Error Propagation Networks with Application to Speech Coding , 1987, NIPS.

[11]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[12]  A. Poritz,et al.  Hidden Markov models: a guided tour , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[13]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[14]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[15]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[16]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[17]  M. Gori,et al.  BPS: a learning algorithm for capturing the dynamic nature of speech , 1989, International 1989 Joint Conference on Neural Networks.

[18]  Michael C. Mozer,et al.  A Focused Backpropagation Algorithm for Temporal Pattern Recognition , 1989, Complex Syst..

[19]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  Richard Rohwer,et al.  The "Moving Targets" Training Algorithm , 1989, NIPS.

[22]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[23]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[24]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[25]  Jude Shavlik,et al.  Refinement ofApproximate Domain Theories by Knowledge-Based Neural Networks , 1990, AAAI.

[26]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[27]  Esther Levin,et al.  Word recognition using hidden control neural architecture , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Yaser S. Abu-Mostafa,et al.  Learning from hints in neural networks , 1990, J. Complex..

[29]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[30]  H. H. Chen,et al.  Recurrent neural networks, hidden Markov models and stochastic grammars , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[31]  Jude W. Shavlik,et al.  Refining Domain Theories Expressed as Finite-State Automata , 1991, ML.

[32]  Michael I. Jordan,et al.  Hierarchies of Adaptive Experts , 1991, NIPS.

[33]  Athanasios Kehagias Stochastic Recurrent Networks: Prediction and Classification of Time Series , 1991 .

[34]  R. D. Lorenz,et al.  A structure by which a recurrent neural network can approximate a nonlinear dynamic system , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[35]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[36]  Michael C. Mozer,et al.  Induction of Multiscale Temporal Structure , 1991, NIPS.

[37]  Michael C. Mozer Neural network music composition and the induction of multiscale temporal structure , 1991, Wissensbasierte Systeme.

[38]  P. Frasconi,et al.  Local Feedback Multi-Layered Networks , 1992 .

[39]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[40]  Eduardo Sontag Systems Combining Linearity and Saturations, and Relations of “Neural Nets” , 1992 .

[41]  Giovanni Soda,et al.  Local Feedback Multilayered Networks , 1992, Neural Computation.

[42]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[43]  C. L. Giles,et al.  Inserting rules into recurrent neural networks , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[44]  Volker Tresp,et al.  Network Structuring and Training Using Rule-Based Knowledge , 1992, NIPS.

[45]  C. Lee Giles,et al.  Training Second-Order Recurrent Neural Networks using Hints , 1992, ML.

[46]  Yoshua Bengio,et al.  Global optimization of a neural network-hidden Markov model hybrid , 1992, IEEE Trans. Neural Networks.

[47]  Hervé Bourlard,et al.  Connectionist speech recognition , 1993 .

[48]  Yoshua Bengio,et al.  Credit Assignment through Time: Alternatives to Backpropagation , 1993, NIPS.

[49]  Michael C. Mozer,et al.  A Unified Gradient-Descent/Clustering Architecture for Finite State Machine Induction , 1993, NIPS.

[50]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[51]  Yoshua Bengio,et al.  The problem of learning long-term dependencies in recurrent networks , 1993, IEEE International Conference on Neural Networks.

[52]  Yoshua Bengio,et al.  Globally trained handwritten word recognizer using spatial representation, space displacement neural networks and hidden Markov models , 1993 .

[53]  Steven J. Nowlan,et al.  Mixtures of Controllers for Jump Linear and Non-Linear Plants , 1993, NIPS.

[54]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[55]  Ah Chung Tsoi,et al.  Locally recurrent globally feedforward networks: a critical review of architectures , 1994, IEEE Trans. Neural Networks.

[56]  Yoshua Bengio,et al.  Diffusion of Credit in Markovian Models , 1994, NIPS.

[57]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[58]  Richard Rohwer The time dimension of neural network models , 1994, SGAR.

[59]  Patrick Haffner,et al.  Discriminant learning with minimum memory loss for improved non-vocabulary rejection , 1995, EUROSPEECH.

[60]  Yoshua Bengio,et al.  Diffusion of Context and Credit Information in Markovian Models , 1995, J. Artif. Intell. Res..

[61]  Giovanni Soda,et al.  Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks , 1995, IEEE Trans. Knowl. Data Eng..