Putting Bayes to sleep

We consider sequential prediction algorithms that are given the predictions from a set of models as inputs. If the nature of the data is changing over time in that different models predict well on different segments of the data, then adaptivity is typically achieved by mixing into the weights in each round a bit of the initial prior (kind of like a weak restart). However, what if the favored models in each segment are from a small subset, i.e. the data is likely to be predicted well by models that predicted well before? Curiously, fitting such "sparse composite models" is achieved by mixing in a bit of all the past posteriors. This self-referential updating method is rather peculiar, but it is efficient and gives superior performance on many natural data sets. Also it is important because it introduces a long-term memory: any model that has done well in the past can be recovered quickly. While Bayesian interpretations can be found for mixing in a bit of the initial prior, no Bayesian interpretation is known for mixing in past posteriors. We build atop the "specialist" framework from the online learning literature to give the Mixing Past Posteriors update a proper Bayesian foundation. We apply our method to a well-studied multitask learning problem and obtain a new intriguing efficient update that achieves a significantly better bound.

[1]  Yoram Singer,et al.  Using and combining predictors that specialize , 1997, STOC '97.

[2]  Frans M. J. Willems,et al.  Switching between two universal source coding algorithms , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[3]  Neri Merhav,et al.  Low-complexity sequential lossless coding for piecewise-stationary memoryless sources , 1998, IEEE Trans. Inf. Theory.

[4]  Darrell D. E. Long,et al.  Adaptive disk spin‐down for mobile computers , 2000, Mob. Networks Appl..

[5]  Manfred K. Warmuth,et al.  Tracking a Small Set of Experts by Mixing Past Posteriors , 2003, J. Mach. Learn. Res..

[6]  Scott A. Brandt,et al.  Adaptive Caching by Refetching , 2002, NIPS.

[7]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[8]  Peter L. Bartlett,et al.  Online discovery of similarity mappings , 2007, ICML '07.

[9]  Peter L. Bartlett,et al.  Multitask Learning with Expert Advice , 2007, COLT.

[10]  Wouter M. Koolen,et al.  Combining Expert Advice Efficiently , 2008, COLT.

[11]  Combining expert advice efficiently , 2008 .

[12]  Peter L. Bartlett,et al.  Matrix regularization techniques for online multitask learning , 2008 .

[13]  Claudio Gentile,et al.  Linear Algorithms for Online Multitask Classification , 2010, COLT.

[14]  Gábor Lugosi,et al.  Online Multi-task Learning with Hard Constraints , 2009, COLT.

[15]  Vladimir Vovk,et al.  Prediction with Expert Evaluators' Advice , 2009, ALT.

[16]  Vladimir Vovk,et al.  Supermartingales in prediction with expert advice , 2008, Theor. Comput. Sci..

[17]  Wouter M. Koolen,et al.  Freezing and Sleeping: Tracking Experts that Learn by Evolving Past Posteriors , 2009, ArXiv.

[18]  Avishek Saha,et al.  Online Learning of Multiple Tasks and Their Relationships , 2011, AISTATS.

[19]  Wouter M. Koolen Combining strategies efficiently: high-quality decisions from conflicting advice , 2011 .

[20]  Nicolò Cesa-Bianchi,et al.  A new look at shifting regret , 2012, ArXiv.