Biclustering Protein Complex Interactions with a Biclique Finding Algorithm

Biclustering has many applications in text mining, Web clickstream mining, and bioinformatics. When data entries are binary, the tightest biclusters become bicliques. We propose a flexible and highly efficient algorithm to compute bicliques. We first generalize the Motzkin-Straus formalism for computing the maximal clique from L1 constraint to Lp constraint, which enables us to provide a generalized Motzkin-Straus formalism for computing maximal-edge bicliques. By adjusting parameters, the algorithm can favor biclusters with more rows less columns, or vice verse, thus increasing the flexibility of the targeted biclusters. We then propose an algorithm to solve the generalized Motzkin-Straus optimization problem. The algorithm is provably convergent and has a computational complexity of O(/E/) where /E/ is the number of edges. Using this algorithm, we bicluster the yeast protein complex interaction network. We find that biclustering protein complexes at the protein level does not clearly reflect the functional linkage among protein complexes in many cases, while biclustering at protein domain level can reveal many underlying linkages. We show several new biologically significant results.

[1]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[2]  Mihir Bellare,et al.  Free bits, PCPs and non-approximability-towards tight results , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[3]  John A Tainer,et al.  Structural analysis of flexible proteins in solution by small angle X-ray scattering combined with crystallography. , 2006, Journal of structural biology.

[4]  Panos M. Pardalos,et al.  Continuous Characterizations of the Maximum Clique Problem , 1997, Math. Oper. Res..

[5]  J. Håstad Clique is hard to approximate withinn1−ε , 1999 .

[6]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[7]  M. Pelillo Relaxation labeling networks for the maximum clique problem , 1996 .

[8]  Gary Siuzdak,et al.  Phospholipid capture combined with non-linear chromatographic correction for improved serum metabolite profiling , 2006, Metabolomics.

[9]  Gary Siuzdak,et al.  Sepsis plasma protein profiling with immunodepletion, three-dimensional liquid chromatography tandem mass spectrometry, and spectrum counting. , 2006, Journal of proteome research.

[10]  David J. Reiss,et al.  The Gaggle: An open-source software system for integrating bioinformatics software and data sources , 2006, BMC Bioinformatics.

[11]  Jinyan Li,et al.  A Correspondence Between Maximal Complete Bipartite Subgraphs and Closed Patterns , 2005, PKDD.

[12]  Marcello Pelillo,et al.  Relaxation labeling networks that solve the maximum clique problem , 1995 .

[13]  Ya Zhang,et al.  Protein Interaction Inference as a MAX-SAT Problem , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) - Workshops.

[14]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  J. Tainer,et al.  Conserved XPB core structure and motifs for DNA unwinding: implications for pathway selection of transcription or excision repair. , 2006, Molecular cell.

[17]  David Eppstein,et al.  Arboricity and Bipartite Subgraph Listing Algorithms , 1994, Inf. Process. Lett..

[18]  Min Pan,et al.  A systems view of haloarchaeal strategies to withstand stress from transition metals. , 2006, Genome research.

[19]  Kenia Whitehead,et al.  An integrated systems approach for understanding cellular responses to gamma radiation , 2006, Molecular systems biology.

[20]  T. Motzkin,et al.  Maxima for Graphs and a New Proof of a Theorem of Turán , 1965, Canadian Journal of Mathematics.

[21]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[22]  Chris H. Q. Ding,et al.  PSoL: a positive sample only learning algorithm for finding non-coding RNA genes , 2006, Bioinform..

[23]  Johan Håstad,et al.  Clique is hard to approximate within n/sup 1-/spl epsiv// , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[24]  G. Siuzdak,et al.  Nonlinear data alignment for UPLC-MS and HPLC-MS based metabolomics: quantitative analysis of endogenous and exogenous metabolites in human serum. , 2006, Analytical chemistry.

[25]  Shoshana J. Wodak,et al.  CYGD: the Comprehensive Yeast Genome Database , 2004, Nucleic Acids Res..

[26]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[27]  J. Jeffry Howbert,et al.  The Maximum Clique Problem , 2007 .

[28]  David J. Reiss,et al.  Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks , 2006, BMC Bioinformatics.

[29]  Hui Xiong,et al.  Transitive closure and metric inequality of weighted graphs: detecting protein interaction modules using cliques , 2006, Int. J. Data Min. Bioinform..

[30]  Chris H. Q. Ding,et al.  Comparative mapping of sequence-based and structure-based protein domains , 2004, BMC Bioinformatics.

[31]  R. Ozawa,et al.  A comprehensive two-hybrid analysis to explore the yeast protein interactome , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[33]  Gary D Bader,et al.  Analyzing yeast protein–protein interaction data obtained from different sources , 2002, Nature Biotechnology.

[34]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Peter L. Hammer,et al.  Consensus algorithms for the generation of all maximal bicliques , 2004, Discret. Appl. Math..

[36]  Xiaofeng He,et al.  A unified representation of multiprotein complex data for modeling interaction networks , 2004, Proteins.

[37]  Kazuhisa Makino,et al.  New Algorithms for Enumerating All Maximal Cliques , 2004, SWAT.

[38]  Sunia A Trauger,et al.  Mass spectrometry reveals specific and global molecular transformations during viral infection. , 2006, Journal of proteome research.

[39]  László Lovász,et al.  Approximating clique is almost NP-complete , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.