A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis.

We present a penalized matrix decomposition (PMD), a new framework for computing a rank-K approximation for a matrix. We approximate the matrix X as circumflexX = sigma(k=1)(K) d(k)u(k)v(k)(T), where d(k), u(k), and v(k) minimize the squared Frobenius norm of X - circumflexX, subject to penalties on u(k) and v(k). This results in a regularized version of the singular value decomposition. Of particular interest is the use of L(1)-penalties on u(k) and v(k), which yields a decomposition of X using sparse vectors. We show that when the PMD is applied using an L(1)-penalty on v(k) but not on u(k), a method for sparse principal components results. In fact, this yields an efficient algorithm for the "SCoTLASS" proposal (Jolliffe and others 2003) for obtaining sparse principal components. This method is demonstrated on a publicly available gene expression data set. We also establish connections between the SCoTLASS method for sparse principal component analysis and the method of Zou and others (2006). In addition, we show that when the PMD is applied to a cross-products matrix, it results in a method for penalized canonical correlation analysis (CCA). We apply this penalized CCA method to simulated data and to a genomic data set consisting of gene expression and DNA copy number measurements on the same set of samples.

[1]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[2]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[3]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[4]  S. Wold Cross-Validatory Estimation of the Number of Components in Factor and Principal Components Models , 1978 .

[5]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[6]  Hideo Matsuda,et al.  Quick selection of representative protein chain sets based on customizable requirements , 2000, Bioinform..

[7]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[8]  L. Lazzeroni Plaid models for gene expression data , 2000 .

[9]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[10]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[11]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[12]  M. Ringnér,et al.  Impact of DNA amplification on gene expression patterns in breast cancer. , 2002, Cancer research.

[13]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[15]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[16]  Trevor Hastie,et al.  Class Prediction by Nearest Shrunken Centroids, with Applications to DNA Microarrays , 2003 .

[17]  C. Molony,et al.  Genetic analysis of genome-wide variation in human gene expression , 2004, Nature.

[18]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[19]  S. Hunt,et al.  Genome-Wide Associations of Gene Expression Variation in Humans , 2005, PLoS genetics.

[20]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[21]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[22]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[23]  Ajay N. Jain,et al.  Genomic and transcriptional aberrations linked to breast cancer pathophysiologies. , 2006, Cancer cell.

[24]  Ian T. Jolliffe,et al.  Projected gradient approach to the numerical solution of the SCoTLASS , 2006, Comput. Stat. Data Anal..

[25]  David Tritchler,et al.  Genome-wide sparse canonical correlation of gene expression with genotypes , 2007, BMC proceedings.

[26]  R. Redon,et al.  Relative Impact of Nucleotide and Copy Number Variation on Gene Expression Phenotypes , 2007, Science.

[27]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[28]  Jianhua Z. Huang,et al.  Sparse principal component analysis via regularized low rank matrix approximation , 2008 .

[29]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[30]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[31]  Alfred O. Hero,et al.  A greedy approach to sparse canonical correlation analysis , 2008, 0801.2748.

[32]  Patrick O. Perry,et al.  Bi-cross-validation of the SVD and the nonnegative matrix factorization , 2009, 0908.2062.

[33]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[34]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[35]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.