Main Effects and Interactions in Mixed and Incomplete Data Frames

Abstract A mixed data frame (MDF) is a table collecting categorical, numerical, and count observations. The use of MDF is widespread in statistics and the applications are numerous from abundance data in ecology to recommender systems. In many cases, an MDF exhibits simultaneously main effects, such as row, column, or group effects and interactions, for which a low-rank model has often been suggested. Although the literature on low-rank approximations is very substantial, with few exceptions, existing methods do not allow to incorporate main effects and interactions while providing statistical guarantees. The present work fills this gap. We propose an estimation method which allows to recover simultaneously the main effects and the interactions. We show that our method is near optimal under conditions which are met in our targeted applications. We also propose an optimization algorithm which provably converges to an optimal solution. Numerical experiments reveal that our method, mimi, performs well when the main effects are sparse and the interaction matrix has low-rank. We also show that mimi compares favorably to existing methods, in particular when the main effects are significantly large compared to the interactions, and when the proportion of missing entries is large. The method is available as an R package on the Comprehensive R Archive Network. Supplementary materials for this article are available online.

[1]  P. Green Iteratively reweighted least squares for maximum likelihood estimation , 1984 .

[2]  L. Ammann Robust Principal Components , 1989 .

[3]  A. Agresti,et al.  Categorical Data Analysis , 1991, International Encyclopedia of Statistical Science.

[4]  H. Kiers Simple structure in component analysis techniques for mixtures of qualitative and quantitative variables , 1991 .

[5]  M. Talagrand A new look at independence , 1996 .

[6]  P. Legendre,et al.  RELATING BEHAVIOR TO HABITAT: SOLUTIONS TO THEFOURTH-CORNER PROBLEM , 1997 .

[7]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data: Little/Statistical Analysis with Missing Data , 2002 .

[8]  Tommi S. Jaakkola,et al.  Weighted Low-Rank Approximations , 2003, ICML.

[9]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[10]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[11]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[12]  Paul Tseng,et al.  A coordinate gradient descent method for nonsmooth separable minimization , 2008, Math. Program..

[13]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[14]  Patricia A. Berglund,et al.  Applied Survey Data Analysis , 2010 .

[15]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[16]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[17]  Liang Zhang,et al.  MODELING ITEM-ITEM SIMILARITIES FOR PERSONALIZED RECOMMENDATIONS ON YAHOO! FRONT PAGE , 2011, 1111.0416.

[18]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[19]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[20]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[21]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[22]  Pablo A. Parrilo,et al.  Rank-Sparsity Incoherence for Matrix Decomposition , 2009, SIAM J. Optim..

[23]  Sham M. Kakade,et al.  Robust Matrix Decomposition With Sparse Corruptions , 2011, IEEE Transactions on Information Theory.

[24]  Constantine Caramanis,et al.  Robust PCA via Outlier Pursuit , 2010, IEEE Transactions on Information Theory.

[25]  Ewout van den Berg,et al.  1-Bit Matrix Completion , 2012, ArXiv.

[26]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[27]  Wen-Xin Zhou,et al.  A max-norm constrained minimization approach to 1-bit matrix completion , 2013, J. Mach. Learn. Res..

[28]  Morteza Mardani,et al.  Recovery of Low-Rank Plus Compressed Sparse Matrices With Application to Unveiling Traffic Anomalies , 2012, IEEE Transactions on Information Theory.

[29]  T. Murdoch,et al.  The inevitable application of big data to health care. , 2013, JAMA.

[30]  C. Neves Categorical data analysis, third edition , 2014 .

[31]  É. Moulines,et al.  Adaptive Multinomial Matrix Completion , 2014, 1408.6218.

[32]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[33]  Pradeep Ravikumar,et al.  Exponential Family Matrix Completion under Structural Constraints , 2014, ICML.

[34]  A. Tsybakov,et al.  Robust matrix completion , 2014, 1412.8132.

[35]  A. Bandeira,et al.  Sharp nonasymptotic bounds on the norm of random matrices with independent entries , 2014, 1408.6185.

[36]  J. Pagès Multiple Factor Analysis by Example Using R , 2014 .

[37]  O. Klopp Noisy low-rank matrix completion with general sampling distribution , 2012, 1203.0108.

[38]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[39]  O. Klopp Matrix completion by singular value thresholding: sharp bounds , 2015, 1502.00146.

[40]  Jean Lafond,et al.  Low Rank Matrix Completion with Exponential Family Noise , 2015, COLT.

[41]  S. Chatterjee,et al.  Matrix estimation by Universal Singular Value Thresholding , 2012, 1212.1247.

[42]  J. Josse,et al.  missMDA: A Package for Handling Missing Values in Multivariate Data Analysis , 2016 .

[43]  Stephen P. Boyd,et al.  Generalized Low Rank Models , 2014, Found. Trends Mach. Learn..

[44]  Yang Cao,et al.  Poisson Matrix Recovery and Completion , 2015, IEEE Transactions on Signal Processing.

[45]  N. Kishore Kumar,et al.  Literature survey on low rank approximation of matrices , 2016, ArXiv.

[46]  C. ter Braak,et al.  A critical issue in model-based inference for studying trait-based community assembly and a solution , 2017, PeerJ.

[47]  J. Schneider,et al.  Literature survey on low rank approximation of matrices , 2017 .

[48]  Raymond K. W. Wong,et al.  Matrix Completion With Covariate Information , 2018, Journal of the American Statistical Association.

[49]  R. Mazumder,et al.  Flexible Low-Rank Statistical Modeling with Missing Data and Side Information , 2018 .

[50]  Balasubramanian Narasimhan,et al.  Imputation of Mixed Data With Multilevel Singular Value Decomposition , 2018, Journal of Computational and Graphical Statistics.

[51]  Yoonkyung Lee,et al.  Generalized Principal Component Analysis: Projection of Saturated Model Parameters , 2019, Technometrics.