Prediction by Supervised Principal Components

In regression problems where the number of predictors greatly exceeds the number of observations, conventional regression techniques may produce unsatisfactory results. We describe a technique called supervised principal components that can be applied to this type of problem. Supervised principal components is similar to conventional principal components analysis except that it uses a subset of the predictors selected based on their association with the outcome. Supervised principal components can be applied to regression and generalized regression problems, such as survival analysis. It compares favorably to other techniques for this type of problem, and can also account for the effects of other covariates and help identify which predictor variables are most important. We also provide asymptotic consistency results to help support our empirical findings. These methods could become important tools for DNA microarray data, where they may be used to more accurately diagnose and treat cancer.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  H. Wold Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach , 1975, Journal of Applied Probability.

[3]  Rupert G. Miller Beyond ANOVA, basics of applied statistics , 1987 .

[4]  M. Gibson,et al.  Beyond ANOVA: Basics of Applied Statistics. , 1986 .

[5]  William S. Peters,et al.  Principal Components and Factor Analysis , 1987 .

[6]  Ker-Chau Li,et al.  Slicing Regression: A Link-Free Regression Method , 1991 .

[7]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[8]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[9]  T. Hastie,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Discussion , 1993 .

[10]  J. Friedman,et al.  A Statistical View of Some Chemometrics Regression Tools , 1993 .

[11]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[12]  James O. Ramsay,et al.  Principal components analysis for functional data , 1997 .

[13]  R. Tibshirani The lasso method for variable selection in the Cox model. , 1997, Statistics in medicine.

[14]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[17]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Trey Ideker,et al.  Testing for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray Data , 2000, J. Comput. Biol..

[19]  I. Johnstone Chi-square oracle inequalities , 2000 .

[20]  R. Tibshirani,et al.  Supervised harvesting of expression trees , 2001, Genome Biology.

[21]  Ash A. Alizadeh,et al.  'Gene shaving' as a method for identifying distinct sets of genes with similar expression patterns , 2000, Genome Biology.

[22]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[23]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[24]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[25]  K. J. Utikal,et al.  Inference for Density Families Using Functional Principal Component Analysis , 2001 .

[26]  Robert Tibshirani,et al.  The Elements of Statistical Learning , 2001 .

[27]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[28]  E. Lander,et al.  Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[29]  Christina Kendziorski,et al.  On Differential Variability of Expression Ratios: Improving Statistical Inference about Gene Expression Changes from Microarray Data , 2001, J. Comput. Biol..

[30]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[31]  T. Poggio,et al.  Multiclass cancer diagnosis using tumor gene expression signatures , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[33]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[34]  Todd,et al.  Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning , 2002, Nature Medicine.

[35]  Debashis Ghosh,et al.  Singular Value Decomposition Regression Models for Classification of Tumors from Microarray Experiments , 2001, Pacific Symposium on Biocomputing.

[36]  Danh V. Nguyen,et al.  Multi-class cancer classification via partial least squares with gene expression profiles , 2002, Bioinform..

[37]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .

[38]  Meland,et al.  The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. , 2002, The New England journal of medicine.

[39]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Danh V. Nguyen,et al.  Tumor classification by partial least squares using microarray gene expression data , 2002, Bioinform..

[41]  F. Chiaromonte,et al.  Dimension reduction strategies for analyzing global gene expression data with a response. , 2002, Mathematical biosciences.

[42]  R. Dennis Cook,et al.  Optimal sufficient dimension reduction in regressions with categorical predictors , 2002 .

[43]  Arthur Yu Lu Sparse principal component analysis for functional data , 2002 .

[44]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[45]  T. Golub,et al.  Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. , 2003, Cancer research.

[46]  Ruth M. Pfeiffer,et al.  Graphical Methods for Class Prediction Using Dimension Reduction Techniques on DNA Microarray Data , 2003, Bioinform..

[47]  Sophie Lambert-Lacroix,et al.  Effective dimension reduction methods for tumor classification using gene expression data , 2003, Bioinform..

[48]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[49]  J. W. Silverstein,et al.  Eigenvalues of large sample covariance matrices of spiked population models , 2004, math/0408165.

[50]  R. Dennis Cook Testing predictor contributions in sufficient dimension reduction , 2004 .

[51]  Bernhard Schölkopf,et al.  A Compression Approach to Support Vector Model Selection , 2004, J. Mach. Learn. Res..

[52]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[53]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[54]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[55]  R. Tibshirani,et al.  Efficient quadratic regularization for expression arrays. , 2004, Biostatistics.

[56]  Jiang Gui,et al.  Partial Cox regression analysis for high-dimensional microarray gene expression data , 2004, ISMB/ECCB.

[57]  Jun Chen,et al.  Joint analysis of two microarray gene-expression data sets to select lung adenocarcinoma marker genes , 2004, BMC Bioinformatics.

[58]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[59]  Kumanan Wilson,et al.  The New International Health Regulations and the Federalism Dilemma , 2005, PLoS medicine.

[60]  S. Geer,et al.  Regularization in statistics , 2006 .

[61]  Robert Tibshirani,et al.  Gene Expression Profiling Predicts Survival in Conventional Renal Cell Carcinoma , 2005, PLoS medicine.

[62]  Klaus Abberger,et al.  Forecasting Quarter-on-Quarter Changes of German GDP with Monthly Business Tendency Survey Results , 2007 .

[63]  C. Heij Improved forecasting with leading indicators: the principal covariate index , 2007 .

[64]  Sabine Van Huffel,et al.  Total Least Squares and Errors-in-variables Modeling , 2007, Comput. Stat. Data Anal..

[65]  Sabine Van Huffel,et al.  Total least squares and errors-in-variables modeling , 2007, Signal Process..

[66]  J. Bai,et al.  Forecasting economic time series using targeted predictors , 2008 .