Online Learning in Adversarial Lipschitz Environments

We consider the problem of online learning in an adversarial environment when the reward functions chosen by the adversary are assumed to be Lipschitz. This setting extends previous works on linear and convex online learning. We provide a class of algorithms with cumulative regret upper bounded by O(√dt ln(λ)) where d is the dimension of the search space, T the time horizon, and λ the Lipschitz constant. Efficient numerical implementations using particle methods are discussed. Applications include online supervised learning problems for both full and partial (bandit) information settings, for a large class of non-linear regressors/classifiers, such as neural networks.

[1]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[2]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[3]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[4]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[5]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[6]  Thomas P. Hayes,et al.  The Price of Bandit Information for Online Optimization , 2007, NIPS.

[7]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[8]  P. Bartlett,et al.  Optimal strategies and minimax lower bounds for online convex games [Technical Report No. UCB/EECS-2008-19] , 2008 .

[9]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[10]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[11]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[12]  中澤 真,et al.  Devroye, L., Gyorfi, L. and Lugosi, G. : A Probabilistic Theory of Pattern Recognition, Springer (1996). , 1997 .

[13]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[14]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[15]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[16]  Ambuj Tewari,et al.  Efficient bandit algorithms for online multiclass prediction , 2008, ICML '08.

[17]  Csaba Szepesvári,et al.  Online Optimization in X-Armed Bandits , 2008, NIPS.

[18]  Jan Poland,et al.  Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments , 2008, Theor. Comput. Sci..

[19]  P. Moral Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications , 2004 .

[20]  R. Douc,et al.  Minimum variance importance sampling via Population Monte Carlo , 2007 .

[21]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[22]  Elizabeth L. Wilmer,et al.  Markov Chains and Mixing Times , 2008 .

[23]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[24]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[25]  Eli Upfal,et al.  Multi-Armed Bandits in Metric Spaces ∗ , 2008 .

[26]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[27]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[28]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[29]  Gilles Stoltz Incomplete information and internal regret in prediction of individual sequences , 2005 .

[30]  Thomas P. Hayes,et al.  Stochastic Linear Optimization under Bandit Feedback , 2008, COLT.