Time Series Modeling with Hidden Variables and Gradient-Based Algorithms

We collect time series from real-world phenomena, such as gene interactions in biology or word frequencies in consecutive news articles. However, these data present us with an incomplete picture, as they result from complex dynamical processes involving unobserved state variables. Research on state-space models is motivated by simultaneously trying to infer hidden state variables from observations, as well as learning the associated dynamic and generative models. To address this problem, I have developed tractable, gradient-based methods for training Dynamic Factor Graphs (DFG) with continuous latent variables. DFGs consist of (potentially highly nonlinear) factors modeling joint probabilities between hidden and observed variables. My hypothesis is that a principled inference of hidden variables is achievable in the energy-based framework, through gradient-based optimization to find the minimum-energy state sequence given observations. This enables higher-order nonlinearities than graphical models. Maximum likelihood learning is done by minimizing the expected energy over training sequences with respect to the factors’ parameters. These alternated inference and parameter updates constitute a deterministic EM-like procedure. Using nonlinear factors such as deep, convolutional networks, DFGs were shown to reconstruct chaotic attractors, to outperform a time series prediction benchmark, and to successfully impute motion capture data in presence of occlusions. In a joint work with the NYU Plant Systems Biology Lab, DFGs have been subsequently employed to the discovery of gene regulation networks by learning the dynamics of mRNA expression levels. DFGs have also been extended into a deep auto-encoder architecture for time-stamped text documents, with word frequencies as inputs. I focused on collections of documents exhibiting temporal structure. Working as dynamic topic models, DFGs could extract latent trajectories from consecutive political speeches; applied to news articles, they achieved state-of-the-art text categorization and retrieval performance. Finally, I used DFGs to evaluate the likelihood of discrete sequences of words in text corpora, relying on dynamics on word embeddings. Collaborating with AT&T Labs Research on a project in speech recognition, we have improved on existing continuous statistical language models by enriching them with word features and long-range topic dependencies.

[1]  E. Lorenz Deterministic nonperiodic flow , 1963 .

[2]  P. Young,et al.  Time series analysis, forecasting and control , 1972, IEEE Transactions on Automatic Control.

[3]  H. Akaike INFORMATION THEORY AS AN EXTENSION OF THE MAXIMUM LIKELIHOOD , 1973 .

[4]  E. Stear,et al.  The simultaneous on-line estimation of parameters and states in linear systems , 1976 .

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[7]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[8]  Aravind K. Joshi,et al.  An Introduction to Tree Adjoining Grammar , 1987 .

[9]  Charles Herring,et al.  Random number generators are chaotic , 1989, CACM.

[10]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Richard Rohwer,et al.  The "Moving Targets" Training Algorithm , 1989, NIPS.

[13]  Martin Casdagli,et al.  Nonlinear prediction of chaotic time series , 1989 .

[14]  A. Krogh A Cost Function for Internal Representations 733 A Cost Function for Internal Representations , 1989 .

[15]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[16]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[17]  Esther Levin Hidden control neural architecture modeling of nonlinear time varying systems and its applications , 1993, IEEE Trans. Neural Networks.

[18]  Eric A. Wan,et al.  Time series prediction by using a connectionist network with internal delay lines , 1993 .

[19]  Lee A. Feldkamp,et al.  Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks , 1994, IEEE Trans. Neural Networks.

[20]  Jose C. Principe,et al.  Reconstructed dynamics and chaotic signal modeling , 1994, Proceedings of IEEE Workshop on Neural Networks for Signal Processing.

[21]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[22]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[23]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[24]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[25]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[26]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[27]  R. Durrett Stochastic Calculus: A Practical Introduction , 1996 .

[28]  Zoubin Ghahramani,et al.  Learning Dynamic Bayesian Networks , 1997, Summer School on Neural Networks.

[29]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[30]  Klaus-Robert Müller,et al.  Analysis of Drifting Dynamics with Neural Network Hidden Markov Models , 1997, NIPS.

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Gunnar Rätsch,et al.  Using support vector machines for time series prediction , 1999 .

[33]  F. Girosi,et al.  Nonlinear prediction of chaotic time series using support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[34]  Srinivas Bangalore,et al.  Complexity of lexical descriptions and its relevance to partial parsing , 1997 .

[35]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Zoubin Ghahramani,et al.  Learning Nonlinear Dynamical Systems Using an EM Algorithm , 1998, NIPS.

[38]  Fiona Banner,et al.  Further references , 1998, Afterall: A Journal of Art, Context and Enquiry.

[39]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[40]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[41]  S. Mallat A wavelet tour of signal processing , 1998 .

[42]  Vladimir Pavlovic,et al.  Time-series classification using mixed-state dynamic Bayesian networks , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[43]  Simon Haykin,et al.  Support vector machines for dynamic reconstruction of a chaotic system , 1999 .

[44]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[45]  P. Grassberger,et al.  A robust method for detecting interdependences: application to intracranially recorded EEG , 1999, chao-dyn/9907013.

[46]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[47]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[48]  D. Yang,et al.  Drift Independent Volatility Estimation Based on High, Low, Open, and Close Prices , 2000 .

[49]  C. Lee Giles,et al.  Learning Chaotic Attractors by Neural Networks , 2000, Neural Computation.

[50]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[51]  L. K. Hansen,et al.  Independent Components in Text , 2000 .

[52]  Rongchen Wang,et al.  Genomic Analysis of a Nutrient Response in Arabidopsis Reveals Diverse Expression Patterns and Novel Metabolic and Potential Regulatory Genes Induced by Nitrate , 2000, Plant Cell.

[53]  Rudolph van der Merwe,et al.  The unscented Kalman filter for nonlinear estimation , 2000, Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373).

[54]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[55]  J. Martinerie,et al.  Comparison of Hilbert transform and wavelet methods for the analysis of neuronal synchrony , 2001, Journal of Neuroscience Methods.

[56]  Gyözö Gidófalvi Using News Articles to Predict Stock Price Movements , 2001 .

[57]  T. Başar,et al.  A New Approach to Linear Filtering and Prediction Problems , 2001 .

[58]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[59]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[60]  David Barber,et al.  Dynamic Bayesian Networks with Deterministic Latent Tables , 2002, NIPS.

[61]  C. Rasmussen,et al.  Gaussian Process Priors with Uncertain Inputs - Application to Multiple-Step Ahead Time Series Forecasting , 2002, NIPS.

[62]  J. Martinerie,et al.  Toward a Neurodynamical Understanding of Ictogenesis , 2003, Epilepsia.

[63]  A. Schulze-Bonhage,et al.  How well can epileptic seizures be predicted? An evaluation of a nonlinear method. , 2003, Brain : a journal of neurology.

[64]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[65]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[66]  Holger Schwenk,et al.  USING CONTINUOUS SPACE LANGUAGE MODELS FOR CONVERSATIONAL SPEECH RECOGNITION , 2003 .

[67]  Neil D. Lawrence,et al.  Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data , 2003, NIPS.

[68]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[69]  Ilya Shmulevich,et al.  On Learning Gene Regulatory Networks Under the Boolean Network Model , 2003, Machine Learning.

[70]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[71]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[72]  John Blitzer,et al.  Hierarchical Distributed Representations for Statistical Language Modeling , 2004, NIPS.

[73]  Jouko Lampinen,et al.  Time series prediction by Kalman smoother with cross-validated noise density , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[74]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[75]  Erkki Oja,et al.  Nonlinear dynamical factor analysis for state change detection , 2004, IEEE Transactions on Neural Networks.

[76]  Zoubin Ghahramani,et al.  Modeling T-cell activation using gene expression profiling and state-space models , 2004, Bioinform..

[77]  A. Schulze-Bonhage,et al.  Comparison of three nonlinear seizure prediction methods by means of the seizure prediction characteristic , 2004 .

[78]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[79]  Pierre Baldi,et al.  On the relationship between deterministic and probabilistic directed Graphical models: From Bayesian networks to recursive neural networks , 2005, Neural Networks.

[80]  Frederick Jelinek,et al.  Some of my Best Friends are Linguists , 2005, Lang. Resour. Evaluation.

[81]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[82]  David J. Fleet,et al.  Gaussian Process Dynamical Models , 2005, NIPS.

[83]  Jürgen Schmidhuber,et al.  Modeling systems with internal state using evolino , 2005, GECCO '05.

[84]  Richard Bonneau,et al.  The Inferelator: an algorithm for learning parsimonious regulatory networks from systems-biology data sets de novo , 2006, Genome Biology.

[85]  Zoubin Ghahramani,et al.  A Bayesian approach to reconstructing genetic regulatory networks with hidden factors , 2005, Bioinform..

[86]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[87]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[88]  Dilek Z. Hakkani-Tür,et al.  The AT&T WATSON speech recognizer , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[89]  M. Barenco,et al.  Ranked prediction of p53 targets using hidden variable dynamic modeling , 2006, Genome Biology.

[90]  Andreas Schulze-Bonhage,et al.  Testing statistical significance of multivariate time series analysis techniques for epileptic seizure prediction. , 2006, Chaos.

[91]  Guodong Liu,et al.  Estimation of missing markers in human motion capture , 2006, The Visual Computer.

[92]  Peter V. Gehler,et al.  The rate adapting poisson model for information retrieval and object recognition , 2006, ICML.

[93]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[94]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[95]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[96]  Kevin Murphy,et al.  Modelling Gene Expression Data using Dynamic Bayesian Networks , 2006 .

[97]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[98]  Trupti Joshi,et al.  Inferring gene regulatory networks from multiple microarray datasets , 2006, Bioinform..

[99]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[100]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[101]  A. Schulze-Bonhage,et al.  Do False Predictions of Seizures Depend on the State of Vigilance? A Report from Two Seizure‐Prediction Methods and Proposed Remedies , 2006, Epilepsia.

[102]  Holger Schwenk,et al.  Continuous Space Language Models for Statistical Machine Translation , 2006, ACL.

[103]  A. Schulze-Bonhage,et al.  Seizure anticipation by patients with focal and generalized epilepsy: A multicentre assessment of premonitory symptoms , 2006, Epilepsy Research.

[104]  Yann LeCun,et al.  Time-Delay Neural Networks and Independent Component Analysis for EEG-Based Prediction of Epileptic Seizures Propagation , 2007, AAAI.

[105]  Erkki Oja,et al.  Time series prediction competition: The CATS benchmark , 2007, Neurocomputing.

[106]  Vladimir Pavlovic,et al.  Conditional State Space Models for Discriminative Motion Estimation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[107]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[108]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[109]  Amy K. Schmid,et al.  A Predictive Model for Transcriptional Control of Physiology in a Free Living Cell , 2007, Cell.

[110]  Yann LeCun,et al.  Discovering the hidden structure of house prices with a non-parametric latent manifold model , 2007, KDD '07.

[111]  R. Yoshida,et al.  Finding module-based gene networks with state-space models - Mining high-dimensional and short time-course gene expression data , 2007, IEEE Signal Processing Magazine.

[112]  Shlomo Geva,et al.  News Aware Volatility Forecasting: Is the Content of News Important? , 2007, AusDM.

[113]  Geoffrey E. Hinton,et al.  Three new graphical models for statistical language modelling , 2007, ICML '07.

[114]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[115]  Christopher S. Poultney,et al.  Insights into the genomic nitrate response using genetics and the Sungear Software System. , 2007, Journal of experimental botany.

[116]  Masao Nagasaki,et al.  Recursive regularization for inferring gene networks from time-course gene expression profiles , 2009, BMC Systems Biology.

[117]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[118]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[119]  J. Cameron,et al.  Real-Time Estimation of Missing Markers in Human Motion Capture , 2008, 2008 2nd International Conference on Bioinformatics and Biomedical Engineering.

[120]  Vladimir Pavlovic,et al.  3D Human Motion Tracking Using Dynamic Probabilistic Latent Semantic Analysis , 2008, 2008 Canadian Conference on Computer and Robot Vision.

[121]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[122]  Yann LeCun,et al.  Comparing SVM and convolutional networks for epileptic seizure prediction from intracranial EEG , 2008, 2008 IEEE Workshop on Machine Learning for Signal Processing.

[123]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[124]  Satoru Miyano,et al.  Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models , 2008, Bioinform..

[125]  Neil D. Lawrence,et al.  Gaussian process modelling of latent chemical species: applications to inferring transcription factor activities , 2008, ECCB.

[126]  Geoffrey E. Hinton,et al.  Improving a statistical language model through non-linear prediction , 2009, Neurocomputing.

[127]  Yi Zhang,et al.  An integrated machine learning approach for predicting DosR-regulated genes in Mycobacterium tuberculosis , 2009, BMC Systems Biology.

[128]  R. Kuick,et al.  Temporal quantitative proteomics by iTRAQ 2D-LC-MS/MS and corresponding mRNA expression analysis identify post-transcriptional modulation of actin-cytoskeleton regulators during TGF-beta-Induced epithelial-mesenchymal transition. , 2009, Journal of proteome research.

[129]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Feature Hierarchies , 2009 .

[130]  Yann LeCun,et al.  Dynamic Factor Graphs for Time Series Modeling , 2009, ECML/PKDD.

[131]  Wray L. Buntine Estimating Likelihoods for Topic Models , 2009, ACML.

[132]  Neil D. Lawrence,et al.  Latent Force Models , 2009, AISTATS.

[133]  Naoki Abe,et al.  Grouped graphical Granger modeling for gene expression regulatory networks discovery , 2009, Bioinform..

[134]  Isabel M. Tienda-Luna,et al.  Reverse engineering gene regulatory networks , 2009, IEEE Signal Processing Magazine.

[135]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[136]  Yann LeCun,et al.  Classification of patterns of EEG synchronization for seizure prediction , 2009, Clinical Neurophysiology.

[137]  G. Krouk,et al.  Nitrate signaling: adaptation to fluctuating environments. , 2010, Current opinion in plant biology.

[138]  Srinivas Bangalore,et al.  Feature-rich continuous language models for speech recognition , 2010, 2010 IEEE Spoken Language Technology Workshop.

[139]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[140]  Satoru Miyano,et al.  Network-Based Predictions and Simulations by Biological State Space Models: Search for Drug Mode of Action , 2010, Journal of Computer Science and Technology.

[141]  Lawrence Carin,et al.  Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[142]  Marc'Aurelio Ranzato,et al.  Dynamic auto-encoders for semantic indexing , 2010 .

[143]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.