Learning Bayesian classifiers from positive and unlabeled examples

The positive unlabeled learning term refers to the binary classification problem in the absence of negative examples. When only positive and unlabeled instances are available, semi-supervised classification algorithms cannot be directly applied, and thus new algorithms are required. One of these positive unlabeled learning algorithms is the positive naive Bayes (PNB), which is an adaptation of the naive Bayes induction algorithm that does not require negative instances. In this work we propose two ways of enhancing this algorithm. On one hand, we have taken the concept behind PNB one step further, proposing a procedure to build more complex Bayesian classifiers in the absence of negative instances. We present a new algorithm (named positive tree augmented naive Bayes, PTAN) to obtain tree augmented naive Bayes models in the positive unlabeled domain. On the other hand, we propose a new Bayesian approach to deal with the a priori probability of the positive class that models the uncertainty over this parameter by means of a Beta distribution. This approach is applied to both PNB and PTAN, resulting in two new algorithms. The four algorithms are empirically compared in positive unlabeled learning problems based on real and synthetic databases. The results obtained in these comparisons suggest that, when the predicting variables are not conditionally independent given the class, the extension of PNB to more complex networks increases the classification performance. They also show that our Bayesian approach to the a priori probability of the positive class can improve the results obtained by PNB and PTAN.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[3]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[4]  David M. J. Tax,et al.  One-class classification , 2001 .

[5]  Jiawei Han,et al.  Text classification from positive and unlabeled documents , 2003, CIKM '03.

[6]  Chris H. Q. Ding,et al.  PSoL: a positive sample only learning algorithm for finding non-coding RNA genes , 2006, Bioinform..

[7]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.

[8]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[9]  大西 仁,et al.  Pearl, J. (1988, second printing 1991). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann. , 1994 .

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[11]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[12]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[13]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[14]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[15]  Pedro Larrañaga,et al.  A partially supervised classification approach to dominant and recessive human disease gene prediction , 2007, Comput. Methods Programs Biomed..

[16]  David G. Stork,et al.  Pattern Classification , 1973 .

[17]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[18]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[19]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..