Max-Margin Markov Networks

In typical classification tasks, we seek a function which assigns a label to a single object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of confidence of the classifier, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guarantees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ignore structure in the problem, assigning labels independently to each object, losing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum margin Markov (M3) networks incorporate both kernels, which efficiently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efficient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwritten character recognition and collective hypertext classification demonstrate very significant gains over previous approaches.

[1]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[2]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[3]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[4]  Robert H. Kassel,et al.  A comparison of approaches to on-line handwritten character recognition , 1995 .

[5]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[6]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[7]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[8]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[9]  W. Freeman,et al.  Generalized Belief Propagation , 2000, NIPS.

[10]  Michael Collins,et al.  Parameter Estimation for Statistical Parsing Models: Theory and Practice of , 2001, IWPT.

[11]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Tong Zhang,et al.  Covering Number Bounds of Certain Regularized Linear Function Classes , 2002, J. Mach. Learn. Res..

[14]  Thomas G. Dietterich Machine Learning for Sequential Data: A Review , 2002, SSPR/SPR.

[15]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[16]  Michael I. Jordan,et al.  Semidefinite methods for approximate inference on graphs with cycles , 2003, IEEE International Symposium on Information Theory, 2003. Proceedings..

[17]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[18]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .