Prediction Algorithms: Complexity, Concentration and Convexity

Abstract In this paper, we review two families of algorithms used lo estimate large-scale statistical models for prediction problems. kernel methods and boosting algorithms. We focus on the computational and statistical properties of prediction algorithms of this kind. Convexity plays an important role for these algorithms. since they exploit the computational advantages of convex optimization procedures. However, in addition to its computational advantages, the use of convexity in these methods also confers some Attractive statistical properties. We present some recent results that show the advantages of convexity for estimation rates, the rates at which the prediction accuracies approach their optimal values. In addition, we present results that quantify the cost of using a convex loss function in place of the real loss function of interest.

[1]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[2]  P. Massart Some applications of concentration inequalities to statistics , 2000 .

[3]  Andrew Blake,et al.  Computationally efficient face detection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[4]  O. Bousquet A Bennett concentration inequality and its application to suprema of empirical processes , 2002 .

[5]  Corinna Cortes,et al.  Boosting Decision Trees , 1995, NIPS.

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[8]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[9]  Bin Yu RATES OF CONVERGENCE FOR EMPIRICAL PROCESSES OF STATIONARY MIXING SEQUENCES , 1994 .

[10]  Peter L. Bartlett,et al.  Functional Gradient Techniques for Combining Hypotheses , 2000 .

[11]  P. Bartlett,et al.  Local Rademacher complexities , 2005, math/0508275.

[12]  M. Vidyasagar,et al.  Rates of uniform convergence of empirical means with mixing processes , 2002 .

[13]  Kiyoshi Asai,et al.  Marginalized kernels for biological sequences , 2002, ISMB.

[14]  Shahar Mendelson,et al.  Improving the sample complexity using global data , 2002, IEEE Trans. Inf. Theory.

[15]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[16]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[17]  Peter L. Bartlett,et al.  Model Selection and Error Estimation , 2000, Machine Learning.

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[20]  Peter L. Bartlett,et al.  The Importance of Convexity in Learning with Squared Loss , 1998, IEEE Trans. Inf. Theory.

[21]  Shahar Mendelson,et al.  Geometric Parameters of Kernel Machines , 2002, COLT.

[22]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[23]  Erik Weyer Finite sample properties of system identification of ARX models under mixing conditions , 2000, Autom..

[24]  Michael Collins,et al.  Ranking Algorithms for Named Entity Extraction: Boosting and the VotedPerceptron , 2002, ACL.

[25]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[26]  G. Lugosi,et al.  Complexity regularization via localized random penalties , 2004, math/0410091.

[27]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[28]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[29]  Tong Zhang,et al.  Sequential greedy approximation for certain convex optimization problems , 2003, IEEE Trans. Inf. Theory.

[30]  A. Dembo,et al.  A note on uniform laws of averages for dependent processes , 1993 .

[31]  Bin Yu,et al.  Boosting with early stopping: Convergence and consistency , 2005, math/0508276.

[32]  Srinivas Bangalore,et al.  Combining prior knowledge and boosting for call classification in spoken language dialogue , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[34]  Nello Cristianini,et al.  Kernel-Based Data Fusion and Its Application to Protein Function Prediction in Yeast , 2003, Pacific Symposium on Biocomputing.

[35]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[36]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[37]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[38]  L. K. Jones,et al.  The computational intractability of training sigmoidal neural networks , 1997, IEEE Trans. Inf. Theory.

[39]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.