Rectified Factor Networks for Biclustering

Biclustering is evolving into one of the major tools for analyzing large datasets given as matrix of samples times features. Biclustering has several noteworthy applications and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. FABIA is one of the most successful biclustering methods and is used by companies like Bayer, Janssen, or Zalando. FABIA is a generative model that represents each bicluster by two sparse membership vectors: one for the samples and one for the features. However, FABIA is restricted to about 20 code units because of the high computational complexity of computing the posterior. Furthermore, code units are sometimes insufficiently decorrelated. Sample membership is difficult to determine because vectors do not have exact zero entries and can have both large positive and large negative values. We propose to use the recently introduced unsupervised Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks of existing biclustering methods. RFNs efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. RFN learning is a generalized alternating minimization algorithm based on the posterior regularization method which enforces non-negative and normalized posterior means. Each code unit represents a bicluster, where samples for which the code unit is active belong to the bicluster and features that have activating weights to the code unit belong to the bicluster. On 400 benchmark datasets with artificially implanted biclusters, RFN significantly outperformed 13 other biclustering competitors including FABIA. In biclustering experiments on three gene expression datasets with known clusters that were determined by separate measurements, RFN biclustering was two times significantly better than the other 13 methods and once on second place. On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate, that interbreeding with other hominins starting already before ancestors of modern humans left Africa.

[1]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010 .

[2]  Bie M. P. Verbist,et al.  Using transcriptomics to guide lead optimization in drug discovery projects: Lessons learned from the QSTAR project. , 2015, Drug discovery today.

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  Adetayo Kasim,et al.  Applied Biclustering Methods for Big and High-Dimensional Data Using R , 2016 .

[5]  Sepp Hochreiter,et al.  Sharing of Very Short IBD Segments between Humans, Neandertals, and Denisovans , 2014, bioRxiv.

[6]  Philip S. Yu,et al.  An Improved Biclustering Method for Analyzing Gene Expression Profiles , 2005, Int. J. Artif. Intell. Tools.

[7]  Bin Li,et al.  Identification of transcription factors for drug-associated gene modules and biomedical implications , 2014, Bioinform..

[8]  L. Staudt,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[9]  Jill P. Mesirov,et al.  Subclass Mapping: Identifying Common Subtypes in Independent Disease Data Sets , 2007, PloS one.

[10]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[11]  Sivaraman Balakrishnan,et al.  Minimax Localization of Structural Information in Large Noisy Matrices , 2011, NIPS.

[12]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[13]  Carl Tim Kelley,et al.  Iterative methods for optimization , 1999, Frontiers in applied mathematics.

[14]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[15]  Jonathan E. Taylor,et al.  Evaluating the statistical significance of biclusters , 2015, NIPS.

[16]  Ben Taskar,et al.  Posterior Regularization for Structured Latent Variable Models , 2010, J. Mach. Learn. Res..

[17]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[18]  Alejandro Murua,et al.  The Gibbs-plaid biclustering model , 2015, 1511.05375.

[19]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[20]  Soheil Feizi,et al.  Biclustering Using Message Passing , 2014, NIPS 2014.

[21]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[22]  S. Hochreiter HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data , 2013, Nucleic acids research.

[23]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[24]  D. Bertsekas On the Goldstein-Levitin-Polyak gradient projection method , 1974, CDC 1974.

[25]  Sepp Hochreiter,et al.  Rectified Factor Networks , 2015, NIPS.

[26]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[27]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[28]  Sven Bergmann,et al.  Defining transcription modules using large-scale gene expression data , 2004, Bioinform..

[29]  A. Orth,et al.  Large-scale analysis of the human and mouse transcriptomes , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Ulrich Bodenhofer,et al.  FABIA: factor analysis for bicluster acquisition , 2010, Bioinform..