RI : Small : Collaborative Research : Probabilistic Models using Generalized Exponential Families

Minimizing convex loss functions is the holy grail of Machine Learning. A plethora of models based on this paradigm have been proposed over the past several decades. A fundamental problem with all such models is that they are not robust to outliers. In contrast, we explore the consequences of building probabilistic models with a parametric family of distributions that have been proposed in statistical physics. This leads to loss functions which are quasi-convex and flatten out for misclassified points which are far away from the decision boundary. Consequently, the models are robust to outliers. In this proposal we outline a research agenda to make models based on these new distribution families practical, and to study their generality and applicability to machine learning. Building upon our recent work on t-logistic regression, a generalization of logistic regression, we will show how conditional models based on this new family of distributions can be developed. The key challenge when working with these generalized distribution families, as in the case of the exponential family, is to compute the log-partition function and perform inference efficiently. We will address this challenge for two specific cases. For problems such as multiclass classification where the number of classes is fairly small, we will develop exact iterative algorithms. On the other hand, for problems such as sequence classification where the number of classes is exponentially large, we will develop approximate inference techniques by extending variational methods. We will also explore models that drop the normalization constraint. This sidesteps the issue of computing the log-partition function. We have done large scale experiments with models that employ convex loss functions and made all our implementations available. Building on this work, we will do systematic benchmark comparisons of the new algorithms with the previous ones and publicize our code.

[1]  Sebastian Tschiatschek,et al.  Introduction to Probabilistic Graphical Models , 2014 .

[2]  S. V. N. Vishwanathan,et al.  T-logistic Regression , 2010, NIPS.

[3]  Jennifer Neville,et al.  Tied Kronecker product graph models to capture variance in network populations , 2010, 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[4]  Timothy D. Sears Generalized Maximum Entropy, Convexity and Machine Learning , 2010 .

[5]  Alexander J. Smola,et al.  Bundle Methods for Regularized Risk Minimization , 2010, J. Mach. Learn. Res..

[6]  Yoav Freund,et al.  A more robust boosting algorithm , 2009, 0905.2138.

[7]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[8]  Rocco A. Servedio,et al.  Random classification noise defeats all convex potential boosters , 2008, ICML '08.

[9]  David Mease,et al.  Evidence Contrary to the Statistical View of Boosting , 2008, J. Mach. Learn. Res..

[10]  N. Schraudolph,et al.  A quasi-Newton approach to non-smooth convex optimization , 2008, ICML '08.

[11]  William W. Cohen,et al.  Proceedings of the 23rd international conference on Machine learning , 2006, ICML 2008.

[12]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[13]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[14]  Dale Schuurmans,et al.  implicit Online Learning with Kernels , 2006, NIPS.

[15]  Mark W. Schmidt,et al.  Accelerated training of conditional random fields with stochastic gradient methods , 2006, ICML.

[16]  Manfred K. Warmuth,et al.  Unlabeled Compression Schemes for Maximum Classes, , 2007, COLT.

[17]  A. Dawid,et al.  Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory , 2004, math/0410076.

[18]  J. Naudts Estimators, escort probabilities, and phi-exponential families in statistical physics , 2004, math-ph/0402005.

[19]  J. Naudts Generalized thermostatistics based on deformed exponential and logarithmic functions , 2003, cond-mat/0311438.

[20]  J. Naudts Generalized thermostatistics and mean-field theory , 2002, cond-mat/0211444.

[21]  S. Eguchi Information Geometry and Statistical Pattern Recognition , 2004 .

[22]  B. Ripley,et al.  Robust Statistics , 2018, Wiley Series in Probability and Statistics.

[23]  Heinz H. Bauschke Duality for Bregman projections onto translated cones and affine subspaces , 2003, J. Approx. Theory.

[24]  J. Naudts Deformed exponentials and logarithms in generalized thermostatistics , 2002, cond-mat/0203489.

[25]  Ted Chang Geometrical foundations of asymptotic inference , 2002 .

[26]  S. D. Pietra,et al.  Duality and Auxiliary Functions for Bregman Distances , 2001 .

[27]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[28]  Ying Nian Wu,et al.  Efficient Algorithms for Robust Estimation in Linear Mixed-Effects Models Using the Multivariate t Distribution , 2001 .

[29]  John D. Lafferty,et al.  Boosting and Maximum Likelihood for Exponential Models , 2001, NIPS.

[30]  J. Lafferty Additive models, boosting, and inference for generalized divergences , 1999, COLT '99.

[31]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[32]  C. Tsallis,et al.  The role of constraints within generalized nonextensive statistics , 1998 .

[33]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[34]  Chuanhai Liu Bayesian robust multivariate linear regression with incomplete data , 1996 .

[35]  Hiroshi Konno,et al.  An outer approximation method for minimizing the product of several convex functions on a convex set , 1993, J. Glob. Optim..

[36]  S. Sheather,et al.  Robust Estimation and Testing , 1990 .

[37]  C. Tsallis Possible generalization of Boltzmann-Gibbs statistics , 1988 .

[38]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[39]  John Law,et al.  Robust Statistics—The Approach Based on Influence Functions , 1986 .

[40]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[41]  O. Barndorff-Nielsen Information and Exponential Families in Statistical Theory , 1980 .

[42]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[43]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .