Joint covariate selection and joint subspace selection for multiple classification problems

We address the problem of recovering a common set of covariates that are relevant simultaneously to several classification problems. By penalizing the sum of ℓ2 norms of the blocks of coefficients associated with each covariate across different classification problems, similar sparsity patterns in all models are encouraged. To take computational advantage of the sparsity of solutions at high regularization levels, we propose a blockwise path-following scheme that approximately traces the regularization path. As the regularization coefficient decreases, the algorithm maintains and updates concurrently a growing set of covariates that are simultaneously active for all problems. We also show how to use random projections to extend this approach to the problem of joint subspace selection, where multiple predictors are found in a common low-dimensional subspace. We present theoretical results showing that this random projection approach converges to the solution yielded by trace-norm regularization. Finally, we present a variety of experimental results exploring joint covariate selection and joint subspace selection, comparing the path-following approach to competing algorithms in terms of prediction accuracy and running time.

[1]  Ker-Chau Li,et al.  Sliced Inverse Regression for Dimension Reduction , 1991 .

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  N. Draper,et al.  Applied Regression Analysis: Draper/Applied Regression Analysis , 1998 .

[4]  Robert P. W. Duin,et al.  Handwritten digit recognition by combined classifiers , 1998, Kybernetika.

[5]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[6]  Wenjiang J. Fu,et al.  Asymptotics for lasso-type estimators , 2000 .

[7]  M. R. Osborne,et al.  A new approach to variable selection in least squares problems , 2000 .

[8]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[9]  Stephen P. Boyd,et al.  A rank minimization heuristic with application to minimum order system approximation , 2001, Proceedings of the 2001 American Control Conference. (Cat. No.01CH37148).

[10]  R. Cook,et al.  Sufficient Dimension Reduction and Graphics in Regression , 2002 .

[11]  Stephen P. Boyd,et al.  Log-det heuristic for matrix rank minimization with applications to Hankel and Euclidean distance matrices , 2003, Proceedings of the 2003 American Control Conference, 2003..

[12]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[13]  Andrew McCallum,et al.  Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data , 2004, J. Mach. Learn. Res..

[14]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[15]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[16]  Noga Alon,et al.  Generalization Error Bounds for Collaborative Prediction with Low-Rank Matrices , 2004, NIPS.

[17]  A. Torralba,et al.  Sharing features: efficient boosting procedures for multiclass object detection , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[18]  Tony Jebara,et al.  Multi-task feature and kernel selection for SVMs , 2004, ICML.

[19]  P. Zhao Boosted Lasso , 2004 .

[20]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[21]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[22]  Dusan Cakmakov,et al.  Handwritten Digit Recognition by Combining , 2005 .

[23]  Adi Shraibman,et al.  Rank, Trace-Norm and Max-Norm , 2005, COLT.

[24]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[25]  D. Donoho For most large underdetermined systems of linear equations the minimal 𝓁1‐norm solution is also the sparsest solution , 2006 .

[26]  Andreas Maurer,et al.  Bounds for Linear Multi-Task Learning , 2006, J. Mach. Learn. Res..

[27]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[28]  Mee Young Park,et al.  Regularization Path Algorithms for Detecting Gene Interactions , 2006 .

[29]  Baolin Wu,et al.  Differential gene expression detection and sample classification using penalized linear regression models , 2006, Bioinform..

[30]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[31]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[32]  Peng Zhao,et al.  Stagewise Lasso , 2007, J. Mach. Learn. Res..

[33]  Shai Ben-David,et al.  A notion of task relatedness yielding provable multiple-task learning guarantees , 2008, Machine Learning.

[34]  Massimiliano Pontil,et al.  Convex multi-task feature learning , 2008, Machine Learning.

[35]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[36]  Francis R. Bach,et al.  Consistency of trace norm minimization , 2007, J. Mach. Learn. Res..

[37]  Weiyu Xu,et al.  Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization , 2008, 2008 47th IEEE Conference on Decision and Control.

[38]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[39]  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2009 .

[40]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[41]  Francis R. Bach,et al.  A New Approach to Collaborative Filtering: Operator Estimation with Spectral Regularization , 2008, J. Mach. Learn. Res..

[42]  Paul Tseng,et al.  A coordinate gradient descent method for linearly constrained smooth optimization and support vector machines training , 2010, Comput. Optim. Appl..