Forward stagewise naïve Bayes

The naïve Bayes approach is a simple but often satisfactory method for supervised classification. In this paper, we focus on the naïve Bayes model and propose the application of regularization techniques to learn a naïve Bayes classifier. The main contribution of the paper is a stagewise version of the selective naïve Bayes, which can be considered a regularized version of the naïve Bayes model. We call it forward stagewise naïve Bayes. For comparison’s sake, we also introduce an explicitly regularized formulation of the naïve Bayes model, where conditional independence (absence of arcs) is promoted via an L1/L2-group penalty on the parameters that define the conditional probability distributions. Although already published in the literature, this idea has only been applied for continuous predictors. We extend this formulation to discrete predictors and propose a modification that yields an adaptive penalization. We show that, whereas the L1/L2 group penalty formulation only discards irrelevant predictors, the forward stagewise naïve Bayes can discard both irrelevant and redundant predictors, which are known to be harmful for the naïve Bayes classifier. Both approaches, however, usually improve the classical naïve Bayes model’s accuracy.

[1]  H. Akaike A new look at the statistical model identification , 1974 .

[2]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[3]  Marco Wiering,et al.  Feature selection for Bayesian network classifiers using the MDL-FS score , 2010, Int. J. Approx. Reason..

[4]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[5]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[6]  David J. Hand,et al.  Data Mining with Products of Trees , 2001, IDA.

[7]  G. Hommel,et al.  Improvements of General Multiple Test Procedures for Redundant Systems of Hypotheses , 1988 .

[8]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[9]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[10]  Tom Heskes,et al.  L1/Lp regularization of Differences , 2008 .

[11]  Pedro M. Domingos,et al.  Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier , 1996, ICML.

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[14]  S. Weisberg Applied Linear Regression , 1981 .

[15]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[16]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[17]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[20]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .