Selecting the number of principal components: estimation of the true rank of a noisy matrix

Principal component analysis (PCA) is a well-known tool in multivariate statistics. One significant challenge in using PCA is the choice of the number of components. In order to address this challenge, we propose an exact distribution-based method for hypothesis testing and construction of confidence intervals for signals in a noisy matrix. Assuming Gaussian noise, we use the conditional distribution of the singular values of a Wishart matrix and derive exact hypothesis tests and confidence intervals for the true signals. Our paper is based on the approach of Taylor, Loftus and Tibshirani (2013) for testing the global null: we generalize it to test for any number of principal components, and derive an integrated version with greater power. In simulation studies we find that our proposed methods compare well to existing approaches.

[1]  A. James Distributions of Matrix Variates and Latent Roots Derived from Normal Samples , 1964 .

[2]  Min-Te Chao,et al.  The Exact Distribution of Bartlett's Test Statistic for Homogeneity of Variances with Unequal Sample Sizes , 1978 .

[3]  Robb J. Muirhead,et al.  Latent Roots and Matrix Variates: A Review of Some Asymptotic Results , 1978 .

[4]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[5]  K. I. Gross,et al.  Total positivity, spherical series, and hypergeometric functions of matrix argu ment , 1989 .

[6]  Donald A. Jackson STOPPING RULES IN PRINCIPAL COMPONENTS ANALYSIS: A COMPARISON OF HEURISTICAL AND STATISTICAL APPROACHES' , 1993 .

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[9]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[10]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[11]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[12]  R Bro,et al.  Cross-validation of component models: A critical look at current methods , 2008, Analytical and bioanalytical chemistry.

[13]  B. Nadler,et al.  Determining the number of components in a factor model from limited noisy data , 2008 .

[14]  Stéphane Dray,et al.  On the number of principal components: A test of dimensionality based on measurements of similarity between matrices , 2008, Comput. Stat. Data Anal..

[15]  Patrick O. Perry,et al.  Bi-cross-validation of the SVD and the nonnegative matrix factorization , 2009, 0908.2062.

[16]  B. Nadler Finite sample approximation results for principal component analysis: a matrix perturbation approach , 2009, 0901.3245.

[17]  Boaz Nadler,et al.  Non-Parametric Detection of the Number of Signals: Hypothesis Testing and Random Matrix Theory , 2009, IEEE Transactions on Signal Processing.

[18]  Raj Rao Nadakuditi,et al.  The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices , 2009, 0910.2120.

[19]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[20]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[21]  Ian T. Jolliffe,et al.  Principal Component Analysis , 2002, International Encyclopedia of Statistical Science.

[22]  Julie Josse,et al.  Selecting the number of components in principal component analysis using cross-validation approximations , 2012, Comput. Stat. Data Anal..

[23]  Jianfeng Yao,et al.  On sample eigenvalues in a generalized spiked population model , 2008, J. Multivar. Anal..

[24]  Joshua R. Loftus,et al.  Inference in adaptive regression via the Kac–Rice formula , 2013, 1308.3020.

[25]  Zongming Ma Sparse Principal Component Analysis and Iterative Thresholding , 2011, 1112.2432.

[26]  Alexandra Chouldechova,et al.  False Discovery Rate Control for Sequential Selection Procedures, with Application to the Lasso , 2013 .

[27]  R. Tibshirani,et al.  Sequential selection procedures and false discovery rate control , 2013, 1309.5352.

[28]  R. Tibshirani,et al.  A Study of Error Variance Estimation in Lasso Regression , 2013, 1311.5274.

[29]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[30]  D. Donoho,et al.  Minimax risk of matrix denoising by singular value thresholding , 2013, 1304.2085.

[31]  Robert Tibshirani,et al.  Post-selection adaptive inference for Least Angle Regression and the Lasso , 2014 .

[32]  David L. Donoho,et al.  The Optimal Hard Threshold for Singular Values is $4/\sqrt {3}$ , 2013, IEEE Transactions on Information Theory.