The IOC algorithm: efficient many-class non-parametric classification for high-dimensional data

This paper is about a variant of k nearest neighbor classification on large many-class high dimensional datasets.K nearest neighbor remains a popular classification technique, especially in areas such as computer vision, drug activity prediction and astrophysics. Furthermore, many more modern classifiers, such as kernel-based Bayes classifiers or the prediction phase of SVMs, require computational regimes similar to k-NN. We believe that tractable k-NN algorithms therefore continue to be important.This paper relies on the insight that even with many classes, the task of finding the majority class among the k nearest neighbors of a query need not require us to explicitly find those k nearest neighbors. This insight was previously used in (Liu et al., 2003) in two algorithms called KNS2 and KNS3 which dealt with fast classification in the case of two classes. In this paper we show how a different approach, IOC (standing for the International Olympic Committee) can apply to the case of n classes where n > 2.IOC assumes a slightly different processing of the datapoints in the neighborhood of the query. This allows it to search a set of metric trees, one for each class. During the searches it is possible to quickly prune away classes that cannot possibly be the majority.We give experimental results on datasets of up to 5.8 x 105 records and 1.5 x 103 attributes, frequently showing an order of magnitude acceleration compared with each of (i) conventional linear scan, (ii) a well-known independent SR-tree implementation of conventional k-NN and (iii) a highly optimized conventional k-NN metric tree search.

[1]  J. Hammersley The Distribution of Distance in a Hypersphere , 1950 .

[2]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[3]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[4]  David W. Aha,et al.  A study of instance-based algorithms for supervised learning tasks: mathematical, empirical, and psychological evaluations , 1990 .

[5]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[6]  Stephen M. Omohundro,et al.  Bumptrees for Efficient Function, Constraint and Classification Learning , 1990, NIPS.

[7]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[8]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[9]  Shin'ichi Satoh,et al.  The SR-tree: an index structure for high-dimensional nearest neighbor queries , 1997, SIGMOD '97.

[10]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[11]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[12]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[13]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[14]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[15]  Lin Chuang Extended Interval Temporal Logic for Undetermined Interval:Modeling and Linear Inference Using Time Petri Nets , 2001 .

[16]  Liu Ting The Inference Engine of Extended Interval Temporal Logic , 2002 .

[17]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[18]  Liu Wei Linear Temporal Inference of Workflow Management System Based on Timed Petri Net Models , 2002 .

[19]  Andrew W. Moore,et al.  New Algorithms for Efficient High-Dimensional Nonparametric Classification , 2006, J. Mach. Learn. Res..

[20]  Alexander G. Gray,et al.  Efficient exact k-NN and nonparametric classification in high dimensions , 2003, NIPS 2003.

[21]  Yanjun Qi,et al.  Supervised classification for video shot segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[22]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[23]  D. Kibler,et al.  Instance-based learning algorithms , 2004, Machine Learning.

[24]  Chuang Lin,et al.  Modeling and Inference of Extended Interval Temporal Logic for Nondeterministic Intervals , 2005, IEEE Trans. Syst. Man Cybern. Part A.