Mining Concept-Drifting Data Streams Containing Labeled and Unlabeled Instances

Recently, mining data streams has attracted significant attention and has been considered as a challenging task in supervised classification. Most of the existing methods dealing with this problem assume the availability of entirely labeled data streams. Unfortunately, such assumption is often violated in real-world applications given that obtaining labels is a time-consuming and expensive task, while a large amount of unlabeled instances are readily available. In this paper, we propose a new approach for handling concept-drifting data streams containing labeled and unlabeled instances. First, we use KL divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classifier is learned using the EM algorithm; otherwise, the current classifier is kept unchanged. Our approach is general so that it can be applied with different classification models. Experiments performed with naive Bayes and logistic regression, on two benchmark datasets, show the good performance of our approach using only limited amounts of labeled instances.

[1]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[2]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[3]  Abraham Bernstein,et al.  Entropy-based Concept Shift Detection , 2006, Sixth International Conference on Data Mining (ICDM'06).

[4]  Ricard Gavaldà,et al.  Learning from Time-Changing Data with Adaptive Windowing , 2007, SDM.

[5]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.

[6]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[7]  Bhavani M. Thuraisingham,et al.  A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[8]  Philip S. Yu,et al.  Classifying Data Streams with Skewed Class Distributions and Concept Drifts , 2008, IEEE Internet Computing.

[9]  Marcus A. Maloof,et al.  Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts , 2007, J. Mach. Learn. Res..

[10]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[11]  João Gama,et al.  Learning with Local Drift Detection , 2006, ADMA.

[12]  Mykola Pechenizkiy,et al.  Dynamic integration of classifiers for handling concept drift , 2008, Inf. Fusion.

[13]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[14]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[15]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[16]  Ralf Klinkenberg,et al.  Learning drifting concepts: Example selection vs. example weighting , 2004, Intell. Data Anal..

[17]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[18]  Gerhard Widmer,et al.  Learning in the Presence of Concept Drift and Hidden Contexts , 1996, Machine Learning.

[19]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[20]  Robert Tibshirani,et al.  Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[21]  Ralf Klinkenberg,et al.  Using Labeled and Unlabeled Data to Learn Drifting Concepts , 2007 .