Active Learning for Anomaly and Rare-Category Detection

We introduce a novel active-learning scenario in which a user wants to work with a learning algorithm to identify useful anomalies. These are distinguished from the traditional statistical definition of anomalies as outliers or merely ill-modeled points. Our distinction is that the usefulness of anomalies is categorized subjectively by the user. We make two additional assumptions. First, there exist extremely few useful anomalies to be hunted down within a massive dataset. Second, both useful and useless anomalies may sometimes exist within tiny classes of similar anomalies. The challenge is thus to identify "rare category" records in an unlabeled noisy set with help (in the form of class labels) from a human expert who has a small budget of datapoints that they are prepared to categorize. We propose a technique to meet this challenge, which assumes a mixture model fit to the data, but otherwise makes no assumptions on the particular form of the mixture components. This property promises wide applicability in real-life scenarios and for various statistical models. We give an overview of several alternative methods, highlighting their strengths and weaknesses, and conclude with a detailed empirical analysis. We show that our method can quickly zoom in on an anomaly set containing a few tens of points in a dataset of hundreds of thousands.

[1]  David J. C. MacKay,et al.  Information-Based Objective Functions for Active Data Selection , 1992, Neural Computation.

[2]  R. Nichol,et al.  The Edinburgh/Durham Southern Galaxy Catalogue , 1992 .

[3]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[4]  Mark Plutowski,et al.  Selecting concise training sets from clean data , 1993, IEEE Trans. Neural Networks.

[5]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[6]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[7]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[8]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  Alexander S. Szalay,et al.  The Sloan Digital Sky Survey , 1999, Comput. Sci. Eng..

[11]  Bruce Margony The Sloan Digital Sky Survey , 1999, Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[12]  R. Nichol,et al.  The Edinburgh/Durham Southern Galaxy Catalogue - IX. The Galaxy Catalogue , 2000, astro-ph/0008184.

[13]  M. Seeger Learning with labeled and unlabeled dataMatthias , 2001 .

[14]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[15]  Stewart Massie,et al.  Index Driven Selective Sampling for CBR , 2003, ICCBR.

[16]  Fabio Gagliardi Cozman,et al.  Semi-Supervised Learning of Mixture Models and Bayesian Networks , 2003 .

[17]  Andrew W. Moore,et al.  Scalable and practical probability density estimators for scientific anomaly detection , 2004 .

[18]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[19]  Sanjoy Dasgupta,et al.  Analysis of a greedy active learning strategy , 2004, NIPS.

[20]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[21]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.