Online learning: theory, algorithms and applications (למידה מקוונת.)

Online learning is the process of answering a sequence of questions given knowledge of the correct answers to previous questions and possibly additional available information. Answering questions in an intelligent fashion and being able to make rational decisions as a result is a basic feature of everyday life. Will it rain today (so should I take an umbrella)? Should I fight the wild animal that is after me, or should I run away? Should I open an attachment in an email message or is it a virus? The study of online learning algorithms is thus an important domain in machine learning, and one that has interesting theoretical properties and practical applications. This dissertation describes a novel framework for the design and analysis of online learning algorithms. We show that various online learning algorithms can all be derived as special cases of our algorithmic framework. This unified view explains the properties of existing algorithms and also enables us to derive several new interesting algorithms. Online learning is performed in a sequence of consecutive rounds, where at each round the learner is given a question and is required to provide an answer to this question. After predicting an answer, the correct answer is revealed and the learner suffers a loss if there is a discrepancy between his answer and the correct one. The algorithmic framework for online learning we propose in this dissertation stems from a connection that we make between the notions of regret in online learning and weak duality in convex optimization. Regret bounds are the common thread in the analysis of online learning algorithms. A regret bound measures the performance of an online algorithm relative to the performance of a competing prediction mechanism, called a competing hypothesis. The competing hypothesis can be chosen in hindsight from a class of hypotheses, after observing the entire sequence of question- answer pairs. Over the years, competitive analysis techniques have been refined and extended to numerous prediction problems by employing complex and varied notions of progress toward a good competing hypothesis. We propose a new perspective on regret bounds which is based on the notion of duality in convex optimization. Regret bounds are universal in the sense that they hold for any possible fixed hypothesis in a given hypothesis class. We therefore cast the universal bound as a lower bound

[1]  Claude E. Shannon,et al.  Prediction and Entropy of Printed English , 1951 .

[2]  I. J. Schoenberg,et al.  The Relaxation Method for Linear Inequalities , 1954, Canadian Journal of Mathematics.

[3]  D. Blackwell An analog of the minimax theorem for vector payoffs. , 1956 .

[4]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[5]  James Hannan,et al.  4. APPROXIMATION TO RAYES RISK IN REPEATED PLAY , 1958 .

[6]  Walter L. Smith Probability and Statistics , 1959, Nature.

[7]  Jaroslav Kožešnk,et al.  Information Theory, Statistical Decision Functions, Random Processes , 1962 .

[8]  Albert B Novikoff,et al.  ON CONVERGENCE PROOFS FOR PERCEPTRONS , 1963 .

[9]  Kazuoki Azuma WEIGHTED SUMS OF CERTAIN DEPENDENT RANDOM VARIABLES , 1967 .

[10]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[11]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[12]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[13]  D. Freedman On Tail Probabilities for Martingales , 1975 .

[14]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[15]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[17]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .

[18]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[19]  Alfredo De Santis,et al.  Learning probabilistic prediction functions , 1988, [Proceedings 1988] 29th Annual Symposium on Foundations of Computer Science.

[20]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[21]  Keiji Kanazawa,et al.  A model for reasoning about persistence and causation , 1989 .

[22]  Nick Littlestone,et al.  From on-line to batch learning , 1989, COLT '89.

[23]  Yoav Freund,et al.  Boosting a weak learning algorithm by majority , 1995, COLT '90.

[24]  N. Littlestone Mistake bounds and logarithmic linear-threshold learning algorithms , 1990 .

[25]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[26]  Neri Merhav,et al.  Universal prediction of individual sequences , 1992, IEEE Trans. Inf. Theory.

[27]  David Haussler,et al.  How to use expert advice , 1993, STOC.

[28]  Dean P. Foster,et al.  A Randomization Rule for Selecting Forecasts , 1993, Oper. Res..

[29]  Maurizio Omologo,et al.  Automatic segmentation and labeling of speech based on Hidden Markov Models , 1993, Speech Commun..

[30]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[31]  Umesh V. Vazirani,et al.  An Introduction to Computational Learning Theory , 1994 .

[32]  Yurii Nesterov,et al.  Interior-point polynomial algorithms in convex programming , 1994, Siam studies in applied mathematics.

[33]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[34]  Manfred K. Warmuth,et al.  On Weak Learning , 1995, J. Comput. Syst. Sci..

[35]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[36]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[37]  Yoav Freund,et al.  Game theory, on-line prediction and boosting , 1996, COLT '96.

[38]  Manfred K. Warmuth,et al.  How to use expert advice , 1997, JACM.

[39]  S. Hart,et al.  A simple adaptive procedure leading to correlated equilibrium , 2000 .

[40]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[41]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[42]  Manfred K. Warmuth,et al.  Exponentiated Gradient Versus Gradient Descent for Linear Predictors , 1997, Inf. Comput..

[43]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[44]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[45]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[46]  Claudio Gentile,et al.  The Robustness of the p-Norm Algorithms , 1999, COLT '99.

[47]  Christopher Raphael,et al.  Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Jason Weston,et al.  Support vector machines for multi-class pattern recognition , 1999, ESANN.

[49]  Geoffrey J. Gordon Regret bounds for prediction problems , 1999, COLT '99.

[50]  A. Dawid,et al.  Prequential probability: principles and properties , 1999 .

[51]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[52]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[53]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Scalable Comput. Pract. Exp..

[54]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[55]  Claudio Gentile,et al.  A New Approximate Maximal Margin Classification Algorithm , 2002, J. Mach. Learn. Res..

[56]  V. Vovk Competitive On‐line Statistics , 2001 .

[57]  Andreu Mas-Colell,et al.  A General Class of Adaptive Strategies , 1999, J. Econ. Theory.

[58]  Ben-Zion Bobrovsky,et al.  Plosive spotting with margin classifiers , 2001, INTERSPEECH.

[59]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[60]  Mark Herbster,et al.  Learning Additive Models Online with Fast Evaluating Kernels , 2001, COLT/EuroCOLT.

[61]  Dan Chazan,et al.  Classification of transition sounds with application to automatic speech recognition , 2001, INTERSPEECH.

[62]  J. Shawe-Taylor Potential-Based Algorithms in On-Line Prediction and Game Theory ∗ , 2001 .

[63]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[64]  Shlomo Dubnov,et al.  Robust temporal and spectral modeling for query By melody , 2002, SIGIR '02.

[65]  Anders Forsgren,et al.  Interior Methods for Nonlinear Optimization , 2002, SIAM Rev..

[66]  Koby Crammer,et al.  A new family of online algorithms for category ranking , 2002, SIGIR '02.

[67]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[68]  Simon King,et al.  Framewise phone classification using support vector machines , 2002, INTERSPEECH.

[69]  John-Paul Hosom Automatic phoneme alignment based on acoustic-phonetic modeling , 2002, INTERSPEECH.

[70]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[71]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[72]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[73]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[74]  Yoram Singer,et al.  Smooth e-Intensive Regression by Loss Symmetrization , 2005, COLT.

[75]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[76]  Gábor Lugosi,et al.  Concentration Inequalities , 2008, COLT.

[77]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[78]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[79]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[80]  Xavier Rodet,et al.  Improving polyphonic and poly-instrumental music to score alignment , 2003, ISMIR.

[81]  Christopher Raphael,et al.  A Hybrid Graphical Model for Aligning Polyphonic Audio with Musical Scores , 2004, ISMIR.

[82]  Alexander J. Smola,et al.  Online learning with kernels , 2001, IEEE Transactions on Signal Processing.

[83]  Manfred K. Warmuth,et al.  Relative Loss Bounds for Multidimensional Regression Problems , 1997, Machine Learning.

[84]  Yi Li,et al.  The Relaxed Online Maximum Margin Algorithm , 1999, Machine Learning.

[85]  Yoram Singer,et al.  An Online Algorithm for Hierarchical Phoneme Classification , 2004, MLMI.

[86]  Manfred K. Warmuth,et al.  Relative Loss Bounds for On-Line Density Estimation with the Exponential Family of Distributions , 1999, Machine Learning.

[87]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[88]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[89]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[90]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[91]  Yoram Singer,et al.  Learning to Align Polyphonic Music , 2004, ISMIR.

[92]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[93]  Yoram Singer,et al.  The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees , 2004, NIPS.

[94]  Yoram Singer,et al.  Smooth epsiloon-Insensitive Regression by Loss Symmetrization , 2005, Journal of machine learning research.

[95]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[96]  Tong Zhang Data Dependent Concentration Bounds for Sequential Prediction Algorithms , 2005, COLT.

[97]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[98]  Yoram Singer,et al.  Phoneme alignment based on discriminative learning , 2005, INTERSPEECH.

[99]  Koby Crammer,et al.  Loss Bounds for Online Category Ranking , 2005, COLT.

[100]  Yoram Singer,et al.  A New Perspective on an Old Perceptron Algorithm , 2005, COLT.

[101]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[102]  Yoram Singer,et al.  The Forgetron: A Kernel-Based Perceptron on a Fixed Budget , 2005, NIPS.

[103]  Yoram Singer,et al.  Online Learning Meets Optimization in the Dual , 2006, COLT.

[104]  Arshia Cont Realtime Audio to Score Alignment for Polyphonic Music Instruments, using Sparse Non-Negative Constraints and Hierarchical HMMS , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[105]  Yoram Singer,et al.  Convex Repeated Games and Fenchel Duality , 2006, NIPS.

[106]  Samy Bengio,et al.  Discriminative kernel-based phoneme sequence recognition , 2006, INTERSPEECH.

[107]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[108]  Geoffrey J. Gordon No-regret Algorithms for Online Convex Programs , 2006, NIPS.

[109]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[110]  Yoram Singer,et al.  Efficient Learning of Label Ranking by Soft Projections onto Polyhedra , 2006, J. Mach. Learn. Res..

[111]  Yoram Singer,et al.  Online Classification for Complex Problems Using Simultaneous Projections , 2006, NIPS.

[112]  Gunnar Rätsch,et al.  Totally corrective boosting algorithms that maximize the margin , 2006, ICML.

[113]  Yishay Mansour,et al.  From External to Internal Regret , 2005, J. Mach. Learn. Res..

[114]  Yoram Singer,et al.  A primal-dual perspective of online learning algorithms , 2007, Machine Learning.

[115]  Yoram Singer,et al.  A Unified Algorithmic Approach for Efficient Online Label Ranking , 2007, AISTATS.

[116]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[117]  Claudio Gentile,et al.  Improved Risk Tail Bounds for On-Line Algorithms , 2005, IEEE Transactions on Information Theory.

[118]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[119]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .