Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic

When the goal is to achieve the best correct classification rate, cross entropy and mean squared error are typical cost functions used to optimize classifier performance. However, for many real-world classification problems, the ROC curve is a more meaningful performance measure. We demonstrate that minimizing cross entropy or mean squared error does not necessarily maximize the area under the ROC curve (AUC). We then consider alternative objective functions for training a classifier to maximize the AUC directly. We propose an objective function that is an approximation to the Wilcoxon-Mann-Whitney statistic, which is equivalent to the AUC. The proposed objective function is differentiable, so gradient-based methods can be used to train the classifier. We apply the new objective function to real-world customer behavior prediction problems for a wireless service provider and a cable service provider, and achieve reliable improvements in the ROC curve.

[1]  Tom M. Mitchell,et al.  Using the Future to Sort Out the Present: Rankprop and Multitask Learning for Medical Risk Evaluation , 1995, NIPS.

[2]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[3]  Thomas M. Breuel,et al.  Classification by probabilistic clustering , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[5]  Joos Vandewalle,et al.  Use of a Multi-Layer Perceptron to Predict Malignancy in Ovarian Tumors , 1997, NIPS.

[6]  Eric Johnson,et al.  Predicting subscriber dissatisfaction and improving retention in the wireless telecommunications industry , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[8]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[9]  Michael C. Mozer,et al.  Prodding the ROC Curve: Constrained Optimization of Classifier Performance , 2001, NIPS.

[10]  Foster Provost,et al.  The effect of class distribution on classifier learning: an empirical study , 2001 .

[11]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[12]  Garrison W. Cottrell,et al.  Optimizing parameters in a ranked retrieval system using multi-query relevance feedback , 1994 .

[13]  Michael C. Mozer,et al.  Improving prediction of customer behavior in nonstationary environments , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).