On the Optimality of the Simple Bayesian Classifier under Zero-One Loss

The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier's probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zero-one loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadratic-loss optimality of the Bayesian classifier is in fact a second-order infinitesimal fraction of the region of zero-one optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article's results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.

[1]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[2]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[3]  Moshe Ben-Bassat,et al.  TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE , 2022 .

[4]  R. Kronmal,et al.  The effect of assuming independence in applying Bayes' theorem to risk estimation and classification in diagnosis. , 1983, Computers and biomedical research, an international journal.

[5]  Tim Niblett,et al.  Constructing Decision Trees in Noisy Domains , 1987, EWSL.

[6]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[7]  Bojan Cestnik,et al.  Estimating Probabilities: A Crucial Task in Machine Learning , 1990, ECAI.

[8]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[9]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[10]  P. Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[11]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[12]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[13]  Pat Langley,et al.  Induction of Recursive Bayesian Classifiers , 1993, ECML.

[14]  Gert Pfurtscheller,et al.  Discovering Patterns in EEG-Signals: Comparative Study of a Few Methods , 1993, ECML.

[15]  Bernhard W. Flury,et al.  Error rates in quadratic discrimination with constraints on the covariance matrices , 1994 .

[16]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[17]  Gregory M. Provan,et al.  A Comparison of Induction Algorithms for Selective and non-Selective Bayesian Classifiers , 1995, ICML.

[18]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[19]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[20]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[21]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[22]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[23]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[24]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[25]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[26]  Gregory M. Provan,et al.  Efficient Learning of Selective Bayesian Network Classifiers , 1996, ICML.

[27]  Ron Kohavi,et al.  Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid , 1996, KDD.

[28]  Robert Tibshirani,et al.  Bias, Variance and Prediction Error for Classification Rules , 1996 .

[29]  G. Provan Eecient Learning of Selective Bayesian Network Classiiers , 1996 .

[30]  Doug Fisher,et al.  Learning from Data: Artificial Intelligence and Statistics V , 1996 .

[31]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[32]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[33]  Ron Kohavi,et al.  Improving simple Bayes , 1997 .

[34]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[35]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[36]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[37]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[38]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[39]  Michael J. Pazzani,et al.  A framework for average case analysis of conjunctive learning algorithms , 2004, Machine Learning.