Inference with transposable data: modelling the effects of row and column correlations

Summary.  We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent because of latent variables or unknown batch effects. By modelling this matrix data by using the matrix variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems that are presented by unexpected correlations: we simultaneously estimate row and column covariances and use these to sphere or decorrelate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple-testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: increased statistical power, less bias in estimating the false discovery rate and reduced variance of the false discovery rate estimators.

[1]  Yinglei Lai,et al.  Genome-wide co-expression based prediction of differential expressions , 2008, Bioinform..

[2]  Xing Qiu,et al.  The effects of normalization on the correlation structure of microarray data , 2005, BMC Bioinformatics.

[3]  Alessio Farcomeni,et al.  A review of modern multiple hypothesis testing, with particular attention to the false discovery proportion , 2008, Statistical methods in medical research.

[4]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  A. Owen Variance of the number of false discoveries , 2005 .

[7]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[8]  R. Olshen,et al.  SUCCESSIVE NORMALIZATION OF RECTANGULAR ARRAYS. , 2010, Annals of statistics.

[9]  S. Dudoit,et al.  Resampling-based multiple testing for microarray data analysis , 2003 .

[10]  S. Sarkar On Methods Controlling the False Discovery Rate 1 , 2009 .

[11]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[12]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[13]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[14]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[15]  B. Efron Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates , 2010, Journal of the American Statistical Association.

[16]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[17]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[18]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  I. Johnstone,et al.  On Consistency and Sparsity for Principal Components Analysis in High Dimensions , 2009, Journal of the American Statistical Association.

[20]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[21]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[22]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Omkar Muralidharan,et al.  Detecting column dependence when rows are correlated and estimating the strength of the row correlation , 2010 .

[24]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[25]  Noureddine El Karoui,et al.  Operator norm consistent estimation of large-dimensional sparse covariance matrices , 2008, 0901.3220.

[26]  Robert Tibshirani,et al.  Correlation-sharing for detection of differential gene expression , 2006, math/0608061.

[27]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[28]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[29]  John D. Storey A direct approach to false discovery rates , 2002 .

[30]  Pranab Kumar Sen Discussion: On methods controlling the false discovery rate , 2008 .

[31]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[32]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[33]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[34]  Xihong Lin,et al.  The effect of correlation in false discovery rate estimation. , 2011, Biometrika.

[35]  K. Desai,et al.  The Distribution of the Number of False Discoveries in DNA Microarray Data , 2007, 2007 IEEE/SP 14th Workshop on Statistical Signal Processing.

[36]  Joshua T. Burdick,et al.  Common genetic variants account for differences in gene expression among ethnic groups , 2007, Nature Genetics.

[37]  Haiyan Huang,et al.  A Statistical Framework to Infer Functional Gene Relationships From Biologically Interrelated Microarray Experiments , 2009 .

[38]  Genevera I. Allen,et al.  TRANSPOSABLE REGULARIZED COVARIANCE MODELS WITH AN APPLICATION TO MISSING DATA IMPUTATION. , 2009, The annals of applied statistics.

[39]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[40]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the lasso , 2007, 0708.3517.

[41]  P. Dutilleul The mle algorithm for the matrix normal distribution , 1999 .

[42]  B. Efron Are a set of microarrays independent of each other? , 2009, The annals of applied statistics.

[43]  Korbinian Strimmer,et al.  Gene ranking and biomarker discovery under correlation , 2009, Bioinform..

[44]  Xing Qiu,et al.  Some Comments on Instability of False Discovery Rate Estimation , 2006, J. Bioinform. Comput. Biol..

[45]  Yudong D. He,et al.  Effects of atmospheric ozone on microarray data quality. , 2003, Analytical chemistry.

[46]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[47]  G. Hommel Multiple test procedures for arbitrary dependence structures , 1986 .

[48]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[49]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[50]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.