Rectified Factor Networks

We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  E. M.,et al.  Statistical Mechanics , 2021, On Complementarity.

[3]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[4]  J. B. Rosen The gradient projection method for nonlinear programming: Part II , 1961 .

[5]  E. M. L. Beale,et al.  Nonlinear Programming: A Unified Approach. , 1970 .

[6]  D. Bertsekas On the Goldstein-Levitin-Polyak gradient projection method , 1974, CDC 1974.

[7]  Rajnikant V. Patel,et al.  Trace inequalities involving Hermitian matrices , 1979 .

[8]  D. Bertsekas Projected Newton methods for optimization problems with simple constraints , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[9]  D. Dowson,et al.  The Fréchet distance between multivariate normal distributions , 1982 .

[10]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[11]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[12]  D. Chakrabarti,et al.  A fast fixed - point algorithm for independent component analysis , 1997 .

[13]  Geoffrey E. Hinton,et al.  Generative models for discovering sparse distributed representations. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[14]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[15]  Brendan J. Frey,et al.  Variational Learning in Nonlinear Gaussian Belief Networks , 1999, Neural Computation.

[16]  Carl Tim Kelley,et al.  Iterative methods for optimization , 1999, Frontiers in applied mathematics.

[17]  Geoffrey E. Hinton,et al.  Variational Learning for Switching State-Space Models , 2000, Neural Computation.

[18]  José Mario Martínez,et al.  Nonmonotone Spectral Projected Gradient Methods on Convex Sets , 1999, SIAM J. Optim..

[19]  Mark A. Girolami,et al.  A Variational Method for Learning Sparse and Overcomplete Representations , 2001, Neural Computation.

[20]  Arkadi Nemirovski,et al.  6. Interior Point Polynomial Time Methods for Linear Programming, Conic Quadratic Programming, and Semidefinite Programming , 2001 .

[21]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[22]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[23]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[24]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[25]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[26]  Bhaskar D. Rao,et al.  Variational EM Algorithms for Non-Gaussian Latent Variable Models , 2005, NIPS.

[27]  R. Steele,et al.  Optimization , 2005, Encyclopedia of Biometrics.

[28]  Luca Zanni,et al.  Gradient projection methods for quadratic programs and applications in training support vector machines , 2005, Optim. Methods Softw..

[29]  A. Kabán,et al.  A variational Bayesian method for rectified factor analysis , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[30]  William J. Byrne,et al.  Convergence Theorems for Generalized Alternating Minimization Procedures , 2005, J. Mach. Learn. Res..

[31]  I. Dhillon,et al.  A New Projected Quasi-Newton Approach for the Nonnegative Least Squares Problem , 2006 .

[32]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[33]  Klaus Obermayer,et al.  A new summarization method for affymetrix probe level data , 2006, Bioinform..

[34]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[35]  Ben Taskar,et al.  Expectation Maximization and Posterior Constraints , 2007, NIPS.

[36]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[37]  Markus Harva,et al.  Variational learning for rectified factor analysis , 2007, Signal Process..

[38]  Richard G. Baraniuk,et al.  Sparse Coding via Thresholding and Local Competition in Neural Circuits , 2008, Neural Computation.

[39]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[40]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[41]  Ben Taskar,et al.  Posterior vs Parameter Sparsity in Latent Variable Models , 2009, NIPS.

[42]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..

[43]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[44]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[45]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[46]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[47]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[48]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[49]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[50]  Jürgen Schmidhuber,et al.  Multi-column deep neural networks for image classification , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Karl J. Friston,et al.  A Free Energy Principle for Biological Systems. , 2012, Entropy.

[52]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[53]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[54]  Koby Crammer,et al.  Confidence-Weighted Linear Classification for Text Categorization , 2012, J. Mach. Learn. Res..

[55]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  S. Hochreiter,et al.  HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data , 2013, Nucleic acids research.

[59]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[60]  Pierre Baldi,et al.  The dropout learning algorithm , 2014, Artif. Intell..

[61]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[62]  Benjamin Graham,et al.  Fractional Max-Pooling , 2014, ArXiv.

[63]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[64]  Yoshua Bengio,et al.  An empirical analysis of dropout in piecewise linear networks , 2013, ICLR.

[65]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[66]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[67]  Bie M. P. Verbist,et al.  Using transcriptomics to guide lead optimization in drug discovery projects: Lessons learned from the QSTAR project. , 2015, Drug discovery today.

[68]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[69]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..