Fast DPP Sampling for Nystrom with Application to Kernel Methods

The Nystrom method has long been popular for scaling up kernel methods. Its theoretical guarantees and empirical performance rely critically on the quality of the landmarks selected. We study landmark selection for Nystrom using Determinantal Point Processes (DPPs), discrete probability models that allow tractable generation of diverse samples. We prove that landmarks selected via DPPs guarantee bounds on approximation errors; subsequently, we analyze implications for kernel ridge regression. Contrary to prior reservations due to cubic complexity of DPP sampling, we show that (under certain conditions) Markov chain DPP sampling requires only linear time in the size of the data. We present several empirical results that support our theoretical analysis, and demonstrate the superior performance of DPP-based landmark selection compared with existing approaches.

[1]  T. Liggett,et al.  Negative dependence and the geometry of polynomials , 2007, 0707.2340.

[2]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[3]  Martin E. Dyer,et al.  A more rapidly mixing Markov chain for graph colorings , 1998, Random Struct. Algorithms.

[4]  Michael W. Mahoney,et al.  Fast Randomized Kernel Ridge Regression with Statistical Guarantees , 2015, NIPS.

[5]  Michael I. Jordan,et al.  Predictive low-rank decomposition for kernel methods , 2005, ICML.

[6]  D. Aldous Some Inequalities for Reversible Markov Chains , 1982 .

[7]  Yuval Peres,et al.  Concentration of Lipschitz Functionals of Determinantal and Other Strong Rayleigh Measures , 2011, Combinatorics, Probability and Computing.

[8]  Ameet Talwalkar,et al.  On the Impact of Kernel Approximation on Learning Accuracy , 2010, AISTATS.

[9]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[10]  Francis R. Bach,et al.  Sharp analysis of low-rank kernel matrix approximations , 2012, COLT.

[11]  Suvrit Sra,et al.  Efficient Sampling for k-Determinantal Point Processes , 2015, AISTATS.

[12]  Zhihua Zhang,et al.  Using The Matrix Ridge Approximation to Speedup Determinantal Point Processes Sampling Algorithms , 2014, AAAI.

[13]  Amin Karbasi,et al.  Fast Mixing for Discrete Point Processes , 2015, COLT.

[14]  Ameet Talwalkar,et al.  Sampling Methods for the Nyström Method , 2012, J. Mach. Learn. Res..

[15]  Patrick J. Wolfe,et al.  On landmark selection and sampling in high-dimensional data analysis , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[16]  Lorenzo Rosasco,et al.  Less is More: Nyström Computational Regularization , 2015, NIPS.

[17]  Y. Peres,et al.  Determinantal Processes and Independence , 2005, math/0503110.

[18]  Byungkon Kang,et al.  Fast Determinantal Point Process Sampling with Application to Clustering , 2013, NIPS.

[19]  Aggelos K. Katsaggelos,et al.  Methods for large scale machine learning , 2016 .

[20]  Ameet Talwalkar,et al.  Large-scale manifold learning , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[22]  Martin Dyer,et al.  A more rapidly mixing Markov chain for graph colorings , 1998 .

[23]  Ivor W. Tsang,et al.  Improved Nyström low-rank approximation and error analysis , 2008, ICML '08.

[24]  L. Rosasco,et al.  Less is More: Nystr\"om Computational Regularization , 2015 .

[25]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[26]  Suvrit Sra,et al.  Gaussian quadrature for matrix inverse forms with applications , 2015, ICML.

[27]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[28]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[29]  O. Macchi The coincidence approach to stochastic point processes , 1975, Advances in Applied Probability.

[30]  Katya Scheinberg,et al.  Efficient SVM Training Using Low-Rank Kernel Representations , 2002, J. Mach. Learn. Res..

[31]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix , 2006, SIAM J. Comput..

[32]  Michael W. Mahoney,et al.  Fast Randomized Kernel Methods With Statistical Guarantees , 2014, ArXiv.

[33]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[34]  Martin E. Dyer,et al.  Path coupling: A technique for proving rapid mixing in Markov chains , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[35]  Ben Taskar,et al.  Nystrom Approximation for Large-Scale Determinantal Processes , 2013, AISTATS.

[36]  Nima Anari,et al.  Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes , 2016, COLT.

[37]  Jan Vondrák,et al.  Symmetry and Approximability of Submodular Maximization Problems , 2009, 2009 50th Annual IEEE Symposium on Foundations of Computer Science.

[38]  Shiliang Sun,et al.  A review of Nyström methods for large-scale machine learning , 2015, Inf. Fusion.

[39]  E. Nyström Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben , 1930 .

[40]  Ameet Talwalkar,et al.  Ensemble Nystrom Method , 2009, NIPS.

[41]  Ben Taskar,et al.  k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[42]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[43]  Alkis Gotovos,et al.  Sampling from Probabilistic Submodular Models , 2015, NIPS.

[44]  Hao Shen,et al.  Fast Kernel-Based Independent Component Analysis , 2009, IEEE Transactions on Signal Processing.

[45]  Ameet Talwalkar,et al.  Large-scale SVD and manifold learning , 2013, J. Mach. Learn. Res..

[46]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Venkatesan Guruswami,et al.  Optimal column-based low-rank matrix reconstruction , 2011, SODA.

[48]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[49]  Mohamed-Ali Belabbas,et al.  Spectral methods in machine learning and new strategies for very large datasets , 2009, Proceedings of the National Academy of Sciences.