Probabilistic Classification Vector Machines

In this paper, a sparse learning algorithm, probabilistic classification vector machines (PCVMs), is proposed. We analyze relevance vector machines (RVMs) for classification problems and observe that adopting the same prior for different classes may lead to unstable solutions. In order to tackle this problem, a signed and truncated Gaussian prior is adopted over every weight in PCVMs, where the sign of prior is determined by the class label, i.e., +1 or -1. The truncated Gaussian prior not only restricts the sign of weights but also leads to a sparse estimation of weight vectors, and thus controls the complexity of the model. In PCVMs, the kernel parameters can be optimized simultaneously within the training algorithm. The performance of PCVMs is extensively evaluated on four synthetic data sets and 13 benchmark data sets using three performance metrics, error rate (ERR), area under the curve of receiver operating characteristic (AUC), and root mean squared error (RMSE). We compare PCVMs with soft-margin support vector machines (SVMSoft), hard-margin support vector machines (SVMHard), SVM with the kernel parameters optimized by PCVMs (SVMPCVM), relevance vector machines (RVMs), and some other baseline classifiers. Through five replications of twofold cross-validation F test, i.e., 5 times 2 cross-validation F test, over single data sets and Friedman test with the corresponding post-hoc test to compare these algorithms over multiple data sets, we notice that PCVMs outperform other algorithms, including SVMSoft, SVMHard, RVM, and SVMPCVM, on most of the data sets under the three metrics, especially under AUC. Our results also reveal that the performance of SVMPCVM is slightly better than SVMSoft, implying that the parameter optimization algorithm in PCVMs is better than cross validation in terms of performance and computational complexity. In this paper, we also discuss the superiority of PCVMs' formulation using maximum a posteriori (MAP) analysis and margin analysis, which explain the empirical success of PCVMs.

[1]  David J. C. MacKay,et al.  The Evidence Framework Applied to Classification Networks , 1992, Neural Computation.

[2]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[3]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[4]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[5]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[6]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[7]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[8]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[9]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  Bernhard Schölkopf,et al.  Improving the accuracy and speed of support vector learning machines , 1997, NIPS 1997.

[12]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[15]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevance Vector Machine , 2001 .

[16]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[17]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[18]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[19]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[20]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[21]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[22]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[25]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[26]  Mário A. T. Figueiredo Adaptive Sparseness for Supervised Learning , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Donald B. Rubin,et al.  Max-imum Likelihood from Incomplete Data , 1972 .

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Brian D. Ripley,et al.  Pattern Recognition and Neural Networks , 1996 .