Bayesian classifiers based on kernel density estimation: Flexible classifiers

When learning Bayesian network based classifiers continuous variables are usually handled by discretization, or assumed that they follow a Gaussian distribution. This work introduces the kernel based Bayesian network paradigm for supervised classification. This paradigm is a Bayesian network which estimates the true density of the continuous variables using kernels. Besides, tree-augmented naive Bayes, k-dependence Bayesian classifier and complete graph classifier are adapted to the novel kernel based Bayesian network paradigm. Moreover, the strong consistency properties of the presented classifiers are proved and an estimator of the mutual information based on kernels is presented. The classifiers presented in this work can be seen as the natural extension of the flexible naive Bayes classifier proposed by John and Langley [G.H. John, P. Langley, Estimating continuous distributions in Bayesian classifiers, in: Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence, 1995, pp. 338-345], breaking with its strong independence assumption. Flexible tree-augmented naive Bayes seems to have superior behavior for supervised classification among the flexible classifiers. Besides, flexible classifiers presented have obtained competitive errors compared with the state-of-the-art classifiers.

[1]  P. van der Putten,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004 .

[2]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[3]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[4]  Mayer Aladjem,et al.  Projection Pursuit Fitting Gaussian Mixture Models , 2002, SSPR/SPR.

[5]  Enrique F. Castillo,et al.  Expert Systems and Probabilistic Network Models , 1996, Monographs in Computer Science.

[6]  Remco R. Bouckaert Naive Bayes Classifiers That Perform Well with Continuous Variables , 2004, Australian Conference on Artificial Intelligence.

[7]  Moon,et al.  Estimation of mutual information using kernel density estimators. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[8]  Jeff A. Bilmes,et al.  A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models , 1998 .

[9]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[10]  Serafín Moral,et al.  Mixtures of Truncated Exponentials in Hybrid Bayesian Networks , 2001, ECSQARU.

[11]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[12]  C. D. Kemp,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[13]  Neil D. Lawrence,et al.  A Comparison of State-of-the-Art Classification Techniques with Application to Cytogenetics , 2001, Neural Computing & Applications.

[14]  Maarten van Someren,et al.  A Bias-Variance Analysis of a Real World Learning Problem: The CoIL Challenge 2000 , 2004, Machine Learning.

[15]  M. Degroot Optimal Statistical Decisions , 1970 .

[16]  E. B. Andersen,et al.  Information Science and Statistics , 1986 .

[17]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[18]  Pedro Larrañaga,et al.  Information Theory and Classification Error in Probabilistic Classifiers , 2006, Discovery Science.

[19]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[20]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[21]  Sariinas Ra Ud Ys ON THE EFFECTIVENESS OF PARZEN WINDOW CLASSIFIER , 1991 .

[22]  Henry Tirri,et al.  On Discriminative Bayesian Network Classifiers and Logistic Regression , 2005, Machine Learning.

[23]  Bin Shen,et al.  Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers , 2002, Machine Learning.

[24]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[25]  D. E. Goldberg,et al.  Optimization and Machine Learning , 2022 .

[26]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[27]  David Maxwell Chickering,et al.  Learning Equivalence Classes of Bayesian Network Structures , 1996, UAI.

[28]  Rafael Rumí,et al.  Learning hybrid Bayesian networks using mixtures of truncated exponentials , 2006, Int. J. Approx. Reason..

[29]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[30]  Pedro M. Domingos A Unifeid Bias-Variance Decomposition and its Applications , 2000, ICML.

[31]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[32]  N. A. Diamantidis,et al.  Unsupervised stratification of cross-validation for accuracy estimation , 2000, Artif. Intell..

[33]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[34]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[35]  L. Devroye THE EQUIVALENCE OF WEAK , STRONG AND COMPLETE CONVERGENCE IN Ll FOR KERNEL DENSITY ESTIMATES ' BY LUC DEVROYE McGill , 1983 .

[36]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[37]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[38]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[39]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[40]  David Heckerman,et al.  Learning Gaussian Networks , 1994, UAI.

[41]  Ron Kohavi,et al.  Improving simple Bayes , 1997 .

[42]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[43]  L. Ryd,et al.  On bias. , 1994, Acta orthopaedica Scandinavica.

[44]  Susanne Bottcher,et al.  Learning Bayesian networks with mixed variables , 2001, AISTATS.

[45]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[46]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[47]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[48]  Šarūnas Raudys On the effectiveness of Parzen window classifier , 1991 .

[49]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[50]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[51]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[52]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[53]  Geoffrey I. Webb,et al.  Discretization for naive-Bayes learning: managing discretization bias and variance , 2008, Machine Learning.

[54]  David G. Stork,et al.  Pattern Classification , 1973 .

[55]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[56]  M. Aladjem Projection pursuit mixture density estimation , 2005, IEEE Transactions on Signal Processing.

[57]  Serafín Moral,et al.  Estimating Mixtures of Truncated Exponentials from Data , 2002, Probabilistic Graphical Models.

[58]  Pedro Larrañaga,et al.  Discriminative Learning of Bayesian Network Classifiers via the TM Algorithm , 2005, ECSQARU.

[59]  Igor Kononenko,et al.  Semi-Naive Bayesian Classifier , 1991, EWSL.

[60]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[61]  Christopher M. Bishop Latent Variable Models , 1998, Learning in Graphical Models.

[62]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[63]  Matthew P. Wand,et al.  Kernel Smoothing , 1995 .

[64]  L. Devroye The Equivalence of Weak, Strong and Complete Convergence in $L_1$ for Kernel Density Estimates , 1983 .

[65]  Tony Jebara,et al.  Machine learning: Discriminative and generative , 2006 .

[66]  N. Wermuth,et al.  Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative , 1989 .

[67]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[68]  Pedro Larrañaga,et al.  Supervised classification with conditional Gaussian networks: Increasing the structure complexity from naive Bayes , 2006, Int. J. Approx. Reason..

[69]  J. Simonoff Smoothing Methods in Statistics , 1998 .

[70]  Boaz Lerner,et al.  Rapid spline-based kernel density estimation for Bayesian networks , 2004 .

[71]  Gareth James,et al.  Variance and Bias for General Loss Functions , 2003, Machine Learning.

[72]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[73]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[74]  José M. N. Leitão,et al.  On Fitting Mixture Models , 1999, EMMCVPR.

[75]  Irène Gijbels,et al.  Practical bandwidth selection in deconvolution kernel density estimation , 2004, Comput. Stat. Data Anal..

[76]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[77]  Ian Witten,et al.  Data Mining , 2000 .

[78]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[79]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[80]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[81]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[82]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[83]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[84]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[85]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[86]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[87]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[88]  Ron Kohavi,et al.  Wrappers for performance enhancement and oblivious decision graphs , 1995 .

[89]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[90]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[91]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[92]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[93]  Boaz Lerner,et al.  Bayesian network classification using spline-approximated kernel density estimation , 2005, Pattern Recognit. Lett..

[94]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[95]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[96]  David W. Scott,et al.  From Kernels to Mixtures , 2001, Technometrics.

[97]  Boaz Lerner Bayesian fluorescence in situ hybridisation signal classification , 2004, Artif. Intell. Medicine.

[98]  José Manuel Gutiérrez,et al.  Expert Systems and Probabiistic Network Models , 1996 .