Semisupervised Multiclass Classification Problems With Scarcity of Labeled Data: A Theoretical Study

In recent years, the performance of semisupervised learning (SSL) has been theoretically investigated. However, most of this theoretical development has focused on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover to the multiclass paradigm. In particular, we consider the key problem in SSL of classifying an unseen instance x into one of K different classes, using a training data set sampled from a mixture density distribution and composed of l labeled records and u unlabeled examples. Even under the assumption of identifiability of the mixture and having infinite unlabeled examples, labeled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labeled examples needed to accomplish that task. Then, we propose an optimal multiclass learning algorithm, which is a generalization of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalization to study the probability of error when the binary class constraint is relaxed.

[1]  Friedrich Leisch,et al.  Identifiability of Finite Mixtures of Multinomial Logit Models with Varying and Fixed Effects , 2008, J. Classif..

[2]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[3]  A. F. Smith,et al.  Statistical analysis of finite mixture distributions , 1986 .

[4]  Georgios C. Anagnostopoulos,et al.  Multiclass Cancer Classification Using Semisupervised Ellipsoid ARTMAP and Particle Swarm Optimization with Gene Expression Data , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[5]  Maya R. Gupta,et al.  Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[6]  Iñaki Inza,et al.  Approaching Sentiment Analysis by using semi-supervised learning of multi-dimensional classifiers , 2012, Neurocomputing.

[7]  Nicu Sebe,et al.  Semisupervised learning of classifiers: theory, algorithms, and their application to human-computer interaction , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[9]  G. M. Tallis,et al.  Identifiability of mixtures , 1982, Journal of the Australian Mathematical Society. Series A. Pure Mathematics and Statistics.

[10]  Shai Ben-David,et al.  Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning , 2008, COLT.

[11]  Mikhail Belkin,et al.  The Value of Labeled and Unlabeled Examples when the Model is Imperfect , 2007, NIPS.

[12]  Philippe Rigollet,et al.  Generalization Error Bounds in Semi-supervised Classification Under the Cluster Assumption , 2006, J. Mach. Learn. Res..

[13]  Santosh S. Venkatesh,et al.  Learning from a mixture of labeled and unlabeled examples with parametric side information , 1995, COLT '95.

[14]  R. Stanley What Is Enumerative Combinatorics , 1986 .

[15]  Patrick Fox-Roberts,et al.  Unbiased generative semi-supervised learning , 2014, J. Mach. Learn. Res..

[16]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[17]  Luoqing Li,et al.  Semisupervised Multicategory Classification With Imperfect Model , 2009, IEEE Transactions on Neural Networks.

[18]  Carey E. Priebe,et al.  The Effect of Model Misspecification on Semi-Supervised Classification , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  B. Everitt An introduction to finite mixture distributions , 1996, Statistical methods in medical research.

[20]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[21]  Vittorio Castelli,et al.  The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter , 1996, IEEE Trans. Inf. Theory.

[22]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[23]  Chris H. Q. Ding,et al.  Image annotation using multi-label correlated Green's function , 2009, 2009 IEEE 12th International Conference on Computer Vision.