Deriving and Analyzing Learning Algorithms

Project Summary There is a large variety of learning problems across all disciplines waiting for the right algorithms. Many of these are on-line problems, where the learning algorithm continually makes predictions and updates its hypothesis after getting each “correct” outcome. Ef£cient algorithms may be unable to keep the entire history, and thus must compress their experience into hypotheses. This leads to a tension when the algorithm predicts incorrectly: it must correct its hypothesis in case the same instance is seen again, yet the algorithm must move cautiously to preserve its previously acquired knowledge. One way to quantify this tradeoff is to put a distance measure on the space of possible hypotheses and optimize the improvement of the prediction on the last example versus the distance moved. For the simple linear regression setting, Kivinen and Warmuth showed how two different distances lead to two radically different families of algorithms. One of these families makes additive updates to its hypothesis and includes the standard gradient descent methods. The other family makes multiplicative updates and has radically different performance. Amortized analysis techniques are used to prove relative loss bounds (similar to competitive ratios) on the algorithms, and these relative loss bounds provide a yardstick to measure the effectiveness of each learning family. Although neither family is better all of the time, the new multiplicative family performs exponentially better in many natural settings. The proposed work will extend the framework of Kivinen and Warmuth in a variety of ways. The existing setup requires a £xed learning rate that must be carefully tuned, and an important proposed direction is to analyze annealed and self-tuned learning rates. The Boosting setting is different from, but closely related to, the on-line learning setting, and the second proposed direction is to modify the framework to cover boosting problems. Most current bounds compare the loss of the algorithm against the best £xed predictor, and the third main direction of the proposal is to extend the framework so that algorithms can be compared against shifting predictors that can change over time.

[1]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[2]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[3]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[4]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[5]  Mark Herbster,et al.  Tracking the Best Expert , 1995, Machine Learning.

[6]  Terrence J. Sejnowski,et al.  TD(λ) Converges with Probability 1 , 1994, Machine Learning.

[7]  Philip M. Long,et al.  Apple Tasting , 2000, Inf. Comput..

[8]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[9]  Adam Tauman Kalai,et al.  On-line algorithms for combining language models , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  David Haussler,et al.  Sequential Prediction of Individual Sequences Under General Loss Functions , 1998, IEEE Trans. Inf. Theory.

[11]  Mark Herbster,et al.  Tracking the best regressor , 1998, COLT' 98.

[12]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[13]  Claudio Gentile,et al.  Improved lower bounds for learning from noisy examples: an information-theoretic approach , 1998, COLT' 98.

[14]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[15]  Andrew Tridgell,et al.  KnightCap: A Chess Programm That Learns by Combining TD(lambda) with Game-Tree Search , 1998, ICML.

[16]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[17]  Sally A. Goldman,et al.  Exploring applications of learning theory to pattern matching and dynamic adjustment of tcp acknowledgement delays , 1998 .

[18]  Vladimir Vovk,et al.  Competitive On-line Linear Regression , 1997, NIPS.

[19]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[20]  Avrim Blum,et al.  On-line Learning and the Metrical Task System Problem , 1997, COLT '97.

[21]  Andrew R. Barron,et al.  Minimax redundancy for the class of memoryless sources , 1997, IEEE Trans. Inf. Theory.

[22]  Yoav Freund,et al.  Predicting a binary sequence almost as well as the optimal biased coin , 2003, COLT '96.

[23]  Yoram Singer,et al.  Training Algorithms for Hidden Markov Models using Entropy Based Distance Functions , 1996, NIPS.

[24]  Darrell D. E. Long,et al.  A dynamic disk spin-down technique for mobile computing , 1996, MobiCom '96.

[25]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[26]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[27]  Dan Roth,et al.  Applying Winnow to Context-Sensitive Spelling Correction , 1996, ICML.

[28]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[29]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[30]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[31]  Manfred K. Warmuth,et al.  Tracking the Best Disjunction , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[32]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[33]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[34]  Avrim Blum,et al.  Empirical Support for Winnow and Weighted-Majority Based Algorithms: Results on a Calendar Scheduling Domain , 1995, ICML.

[35]  Manfred K. Warmuth,et al.  The perceptron algorithm vs. Winnow: linear vs. logarithmic mistake bounds when few input variables are relevant , 1995, COLT '95.

[36]  Yoram Singer,et al.  A Comparison of New and Old Algorithms for a Mixture Estimation Problem , 1995, COLT '95.

[37]  Robert E. Schapire,et al.  Predicting Nearly As Well As the Best Pruning of a Decision Tree , 1995, COLT '95.

[38]  Manfred K. Warmuth,et al.  Efficient Learning With Virtual Threshold Gates , 1995, Inf. Comput..

[39]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[40]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[41]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[42]  Philip M. Long,et al.  On-line learning with linear loss constraints , 1993, COLT '93.

[43]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[44]  M. Budinich,et al.  Some notes on perceptron learning , 1993, IEEE International Conference on Neural Networks.

[45]  P. Massart,et al.  Rates of convergence for minimum contrast estimators , 1993 .

[46]  Radford M. Neal A new view of the EM algorithm that justifies incremental and other variants , 1993 .

[47]  M. Budinich,et al.  Geometrical interpretation of the back-propagation algorithm for the perceptron , 1992 .

[48]  J. N. Kapur,et al.  Entropy optimization principles with applications , 1992 .

[49]  Dean Phillips Foster Prediction in the Worst Case , 1991 .

[50]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[51]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[52]  Guy Jumarie,et al.  Relative Information — What For? , 1990 .

[53]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[54]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[55]  David Haussler,et al.  Learnability and the Vapnik-Chervonenkis dimension , 1989, JACM.

[56]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[57]  Terrence J. Sejnowski,et al.  A 'Neural' Network that Learns to Play Backgammon , 1987, NIPS.

[58]  R. Tarjan Amortized Computational Complexity , 1985 .

[59]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[60]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[61]  R. Redner,et al.  Mixture densities, maximum likelihood, and the EM algorithm , 1984 .

[62]  Y. Censor,et al.  An iterative row-action method for interval convex programming , 1981 .

[63]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[64]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[65]  H. Robbins Some aspects of the sequential design of experiments , 1952 .