Alternating minimization for dictionary learning with random initialization

We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1},y^{2},\ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*},\ldots,x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace the condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). This not only allows us to get convergence rates for the error of the estimated dictionary measured in the matrix infinity norm, but also ensures that a random initialization will provably converge to the global optimum. Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm.

[1]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[2]  Bin Yu Assouad, Fano, and Le Cam , 1997 .

[3]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[4]  Kjersti Engan,et al.  Method of optimal directions for frame design , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[5]  Terrence J. Sejnowski,et al.  Learning Overcomplete Representations , 2000, Neural Computation.

[6]  Xiaoming Huo,et al.  Uncertainty principles and ideal atomic decomposition , 2001, IEEE Trans. Inf. Theory.

[7]  Rémi Gribonval,et al.  Sparse representations in unions of bases , 2003, IEEE Trans. Inf. Theory.

[8]  Jean-Jacques Fuchs,et al.  Recovery of exact sparse representations in the presence of noise , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  A. Bruckstein,et al.  K-SVD : An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation , 2005 .

[10]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[11]  Joel A. Tropp,et al.  Just relax: convex programming methods for identifying sparse signals in noise , 2006, IEEE Transactions on Information Theory.

[12]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[13]  Michael Elad,et al.  Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries , 2006, IEEE Transactions on Image Processing.

[14]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[15]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[16]  A. Tsybakov,et al.  Sparse recovery under matrix uncertainty , 2008, 0812.2818.

[17]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[18]  A. Tsybakov,et al.  High-dimensional instrumental variables regression and confidence sets -- v2/2012 , 2018, 1812.11330.

[19]  Huan Wang,et al.  Exact Recovery of Sparsely-Used Dictionaries , 2012, COLT.

[20]  A. Tsybakov,et al.  Improved Matrix Uncertainty Selector , 2011, 1112.4413.

[21]  Prateek Jain,et al.  Low-rank matrix completion using alternating minimization , 2012, STOC '13.

[22]  Constantine Caramanis,et al.  Noisy and Missing Data Regression: Distribution-Oblivious Support Recovery , 2013, ICML.

[23]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[24]  Prateek Jain,et al.  Non-convex Robust PCA , 2014, NIPS.

[25]  A. Tsybakov,et al.  Linear and conic programming estimators in high dimensional errors‐in‐variables models , 2014, 1408.0241.

[26]  Prateek Jain,et al.  Learning Sparsely Used Overcomplete Dictionaries , 2014, COLT.

[27]  Sanjeev Arora,et al.  New Algorithms for Learning Incoherent and Overcomplete Dictionaries , 2013, COLT.

[28]  Martin J. Wainwright,et al.  Statistical guarantees for the EM algorithm: From population to sample-based analysis , 2014, ArXiv.

[29]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[30]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[31]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[32]  David Steurer,et al.  Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method , 2014, STOC.

[33]  Rémi Gribonval,et al.  Sparse and Spurious: Dictionary Learning With Noise and Outliers , 2014, IEEE Transactions on Information Theory.

[34]  Tengyu Ma,et al.  A Non-generative Framework and Convex Relaxations for Unsupervised Learning , 2016, NIPS.

[35]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[36]  Bin Yu,et al.  Local Identifiability of $\ell_1$-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition , 2015, J. Mach. Learn. Res..

[37]  Praneeth Netrapalli,et al.  A Clustering Approach to Learning Sparsely Used Overcomplete Dictionaries , 2013, IEEE Transactions on Information Theory.