Tail bounds for volume sampled linear regression

The $n \times d$ design matrix in a linear regression problem is given, but the response for each point is hidden unless explicitly requested. The goal is to observe only a small number $k \ll n$ of the responses, and then produce a weight vector whose sum of square loss over all points is at most $1+\epsilon$ times the minimum. A standard approach to this problem is to use i.i.d. leverage score sampling, but this approach is known to perform poorly when $k$ is small (e.g., $k = d$); in such cases, it is dominated by volume sampling, a joint sampling method that explicitly promotes diversity. How these methods compare for larger $k$ was not previously understood. We prove that volume sampling can have poor behavior for large $k$ - indeed worse than leverage score sampling. We also show how to repair volume sampling using a new padding technique. We prove that padded volume sampling has at least as good a tail bound as leverage score sampling: sample size $k=O(d\log d + d/\epsilon)$ suffices to guarantee total loss at most $1+\epsilon$ times the minimum with high probability. The main technical challenge is proving tail bounds for the sums of dependent random matrices that arise from volume sampling.

[1]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[2]  David P. Woodruff,et al.  Relative Error Tensor Low Rank Approximation , 2017, Electron. Colloquium Comput. Complex..

[3]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[4]  Ben Taskar,et al.  k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[5]  Neil Olver,et al.  Pipage Rounding, Pessimistic Estimators and Matrix Concentration , 2013, SODA.

[6]  Manfred K. Warmuth,et al.  Unbiased estimates for linear regression via volume sampling , 2017, NIPS.

[7]  Charles R. Johnson,et al.  The singular values of a Hadamard product : a basic inequality , 1987 .

[8]  Joel A. Tropp,et al.  User-Friendly Tail Bounds for Sums of Random Matrices , 2010, Found. Comput. Math..

[9]  Manfred K. Warmuth,et al.  Subsampling for Ridge Regression via Regularized Volume Sampling , 2017, AISTATS.

[10]  Yuval Peres,et al.  Concentration of Lipschitz Functionals of Determinantal and Other Strong Rayleigh Measures , 2011, Combinatorics, Probability and Computing.

[11]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[12]  Luis Rademacher,et al.  Efficient Volume Sampling for Row/Column Subset Selection , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[13]  Wouter M. Koolen,et al.  Minimax Fixed-Design Linear Regression , 2015, COLT.

[14]  Philip M. Long,et al.  WORST-CASE QUADRATIC LOSS BOUNDS FOR ON-LINE PREDICTION OF LINEAR FUNCTIONS BY GRADIENT DESCENT , 1993 .

[15]  Christos Boutsidis,et al.  Faster Subset Selection for Matrices and Applications , 2011, SIAM J. Matrix Anal. Appl..

[16]  Ulrich Paquet,et al.  Bayesian Low-Rank Determinantal Point Processes , 2016, RecSys.

[17]  Christos Boutsidis,et al.  Near-Optimal Coresets for Least-Squares Regression , 2012, IEEE Transactions on Information Theory.

[18]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[19]  Vincent Nesme,et al.  Note on sampling without replacing from a finite collection of matrices , 2010, ArXiv.