Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent

For large scale learning problems, it is desirable if we can obtain the optimal model parameters by going through the data in only one pass. Polyak and Juditsky (1992) showed that asymptotically the test performance of the simple average of the parameters obtained by stochastic gradient descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received little attention in recent research on large scale learning. One possible reason is that it may take a prohibitively large number of training samples for ASGD to reach its asymptotic region for most real problems. In this paper, we present a finite sample analysis for the method of Polyak and Juditsky (1992). Our analysis shows that it indeed usually takes a huge number of samples for ASGD to reach its asymptotic region for improperly chosen learning rate. More importantly, based on our analysis, we propose a simple way to properly set learning rate so that it takes a reasonable amount of data for ASGD to reach its asymptotic region. We compare ASGD using our proposed learning rate with other well known algorithms for training large scale linear classifiers. The experiments clearly show the superiority of ASGD.

[1]  V. Fabian On Asymptotic Normality in Stochastic Approximation , 1968 .

[2]  V. Fabian Asymptotically Efficient Stochastic Approximation; The RM Case , 1973 .

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[6]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[7]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[8]  Léon Bottou,et al.  On-line learning for very large data sets , 2005 .

[9]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[10]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[11]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[12]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[13]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[14]  John Langford,et al.  Sparse Online Learning via Truncated Gradient , 2008, NIPS.

[15]  Ambuj Tewari,et al.  Stochastic methods for l1 regularized loss minimization , 2009, ICML '09.

[16]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[17]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[18]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..