Discovering Non-redundant Overlapping Biclusters on Gene Expression Data

Given a gene expression data matrix where each cell is the expression level of a gene under a certain condition, biclustering is the problem of searching for a subset of genes that co regulate and co express only under a subset of conditions. As some genes can belong to different functional categories, searching for non-redundant overlapping biclusters is an important problem in biclustering. However, most recent algorithms can only either produce disjoint biclusters or redundant biclusters with significant overlap. In other words, these algorithms do not allow users to specify the maximum overlap between the biclusters. In this paper, we propose a novel algorithm which can generate K overlapping biclusters where the maximum overlap between them is below a predefined threshold. Unlike the other approaches which often generate all biclusters at once, our algorithm produces the biclusters sequentially, where each newly generated bicluster is guaranteed to be different from the previous ones but can still overlap with them. The experiments on real datasets confirm that different meaningful overlapping biclusters are successfully discovered. Besides, under the same constraints, our algorithm returns much larger and higher-quality biclusters compared to those of the other state-of-the art algorithms.

[1]  Thomas Hofmann,et al.  Non-redundant clustering with conditional ensembles , 2005, KDD '05.

[2]  Roberto Battiti,et al.  A flexible cluster-oriented alternative clustering algorithm for choosing from the Pareto front of solutions , 2013, Machine Learning.

[3]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[4]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[5]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[6]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Mauro Brunato,et al.  Reactive Search and Intelligent Optimization , 2008 .

[8]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[9]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[10]  Mauro Brunato,et al.  A Repeated Local Search Algorithm for BiClustering of Gene Expression Data , 2013, SIMBAD.

[11]  Pierre Baldi,et al.  DNA Microarrays and Gene Expression - From Experiments to Data Analysis and Modeling , 2002 .

[12]  Wojtek J. Krzanowski,et al.  Improved biclustering of microarray data demonstrated through systematic performance tests , 2005, Comput. Stat. Data Anal..

[13]  Philip S. Yu,et al.  Enhanced biclustering on expression data , 2003, Third IEEE Symposium on Bioinformatics and Bioengineering, 2003. Proceedings..

[14]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[15]  Emmanuel Müller,et al.  Detection of orthogonal concepts in subspaces of high dimensional data , 2009, CIKM.

[16]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[17]  G. W. Hatfield,et al.  DNA microarrays and gene expression , 2002 .