Sparse Principal Component Analysis

Principal component analysis (PCA) is widely used in data processing and dimensionality reduction. However, PCA suffers from the fact that each principal component is a linear combination of all the original variables, thus it is often difficult to interpret the results. We introduce a new method called sparse principal component analysis (SPCA) using the lasso (elastic net) to produce modified principal components with sparse loadings. We first show that PCA can be formulated as a regression-type optimization problem; sparse loadings are then obtained by imposing the lasso (elastic net) constraint on the regression coefficients. Efficient algorithms are proposed to fit our SPCA models for both regular multivariate data and gene expression arrays. We also give a new formula to compute the total variance of modified principal components. As illustrations, SPCA is applied to real and simulated data with encouraging results.

[1]  O. Bagasra,et al.  Proceedings of the National Academy of Sciences , 1914, Science.

[2]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[3]  J. N. R. Jeffers,et al.  Two Case Studies in the Application of Principal Component Analysis , 1967 .

[4]  Jorge Cadima Departamento de Matematica Loading and correlations in the interpretation of principle compenents , 1995 .

[5]  I. Jolliffe Rotation of principal components: choice of normalization constraints , 1995 .

[6]  V. Bruce,et al.  Face processing: Human perception and principal components analysis , 1996, Memory & cognition.

[7]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[8]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Vines Simple principal components , 2000 .

[10]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[11]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[12]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[14]  I. Jolliffe Principal Component Analysis , 2002 .

[15]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  H. Zou,et al.  Regression Shrinkage and Selection via the Elastic Net , with Applications to Microarrays , 2003 .

[17]  I. Jolliffe,et al.  A Modified Principal Component Technique Based on the LASSO , 2003 .

[18]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[19]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[20]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .