New Algorithms for Efficient High-Dimensional Nonparametric Classification

This paper is about non-approximate acceleration of high-dimensional nonparametric operations such as k nearest neighbor classifiers. We attempt to exploit the fact that even if we want exact answers to nonparametric queries, we usually do not need to explicitly find the data points close to the query, but merely need to answer questions about the properties of that set of data points. This offers a small amount of computational leeway, and we investigate how much that leeway can be exploited. This is applicable to many algorithms in nonparametric statistics, memory-based learning and kernel-based learning. But for clarity, this paper c oncentrates on pure k-NN classification. We introduce new ball-tree algorithms that on real-world data sets give accelerations from 2-fold to 100-fold compared against highly optimized traditional ball-tree-based k-NN . These results include data sets with up to 10 6 dimensions and 10 5 records, and demonstrate non-trivial speed-ups

[1]  J. Hammersley The Distribution of Distance in a Hypersphere , 1950 .

[2]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[3]  C. G. Hilborn,et al.  The Condensed Nearest Neighbor Rule , 1967 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Chin-Liang Chang,et al.  Finding Prototypes For Nearest Neighbor Classifiers , 1974, IEEE Transactions on Computers.

[6]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Computing k-Nearest Neighbors , 1975, IEEE Transactions on Computers.

[7]  Hugh B. Woodruff,et al.  An algorithm for a selective nearest neighbor decision rule (Corresp.) , 1975, IEEE Trans. Inf. Theory.

[8]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[9]  Norman R. Draper,et al.  Applied regression analysis (2. ed.) , 1981, Wiley series in probability and mathematical statistics.

[10]  I. Sethi A Fast Algorithm for Recognizing Nearest Neighbors , 1981, IEEE Transactions on Systems, Man, and Cybernetics.

[11]  L. Devroye,et al.  8 Nearest neighbor methods in discrimination , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[14]  Franco P. Preparata,et al.  Computational Geometry , 1985, Texts and Monographs in Computer Science.

[15]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[16]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[17]  Stephen M. Omohundro,et al.  Bumptrees for Efficient Function, Constraint and Classification Learning , 1990, NIPS.

[18]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[19]  Jeffrey K. Uhlmann,et al.  Satisfying General Proximity/Similarity Queries with Metric Trees , 1991, Inf. Process. Lett..

[20]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[21]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[22]  Douglas W. Oard,et al.  A survey of information retrieval and filtering methods , 1995 .

[23]  Dragutin Petkovic,et al.  Query by Image and Video Content: The QBIC System , 1995, Computer.

[24]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[26]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[27]  Peter L. Bartlett,et al.  The Canonical Distortion Measure in Feature Space and 1-NN Classification , 1997, NIPS.

[28]  Kevin W. Bowyer,et al.  Combination of Multiple Classifiers Using Local Accuracy Estimates , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[30]  Essaid Bouktache,et al.  A Fast Algorithm for the Nearest-Neighbor Classifier , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Claire Cardie,et al.  Improving Minority Class Prediction Using Case-Specific Feature Weights , 1997, ICML.

[32]  Salvatore J. Stolfo,et al.  Credit Card Fraud Detection Using Meta-Learning: Issues and Initial Results 1 , 1997 .

[33]  Yoshihiko Hamamoto,et al.  A Bootstrap Technique for Nearest Neighbor Classifier Design , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  G. Gates The Reduced Nearest Neighbor Rule , 1998 .

[35]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[36]  Soo-Ik Chae,et al.  Fast Design of Reduced-Complexity Nearest-Neighbor Classifiers Using Triangular Inequality , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[37]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[38]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[39]  Arnold W. M. Smeulders,et al.  Image Databases and Multi-Media Search , 1998, Image Databases and Multi-Media Search.

[40]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[41]  Robert R. Snapp,et al.  The labelled cell classifier: a fast approximation to k nearest neighbors , 1998, Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170).

[42]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[43]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[44]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[45]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[46]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[47]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.

[48]  H. Aso,et al.  A fast algorithm for a k‐NN classifier based on the branch and bound method and computational quantity estimation , 2000 .

[49]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[50]  Edwin P. D. Pednault,et al.  Handling Imbalanced Data Sets in Insurance Risk Modeling , 2000 .

[51]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[52]  A fast algorithm for a k-NN classifier based on the branch and bound method and computational quantity estimation , 2000, Systems and Computers in Japan.

[53]  Piotr Indyk,et al.  On Approximate Nearest Neighbors under linfinity Norm , 2001, J. Comput. Syst. Sci..

[54]  David M. Mount,et al.  The Analysis of a Probabilistic Approach to Nearest Neighbor Searching , 2001, WADS.

[55]  Dennis DeCoste,et al.  Anytime Interval-Valued Outputs for Kernel Machines: Fast Support Vector Machine Classification via Distance Geometry , 2002, ICML.

[56]  Gustavo E. A. P. A. Batista,et al.  Learning with Skewed Class Distributions , 2002 .

[57]  Thomas G. Dietterich,et al.  Editors. Advances in Neural Information Processing Systems , 2002 .

[58]  K. Clarkson Nearest Neighbor Searching in Metric Spaces : Experimental Results for sb ( S ) , 2002 .

[59]  Dennis DeCoste,et al.  Anytime Query-Tuned Kernel Machines via Cholesky Factorization , 2003, SDM.

[60]  Andrew W. Moore,et al.  Fast Robust Logistic Regression for Large Sparse Datasets with Binary Outputs , 2003, AISTATS.

[61]  Christopher Krügel,et al.  Anomaly detection of web-based attacks , 2003, CCS '03.

[62]  Dominic Mazzoni,et al.  Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors , 2003, ICML.

[63]  Yanjun Qi,et al.  Supervised classification for video shot segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[64]  Alex Pentland,et al.  Photobook: Content-based manipulation of image databases , 1996, International Journal of Computer Vision.

[65]  Andrew W. Moore,et al.  An Investigation of Practical Approximate Nearest Neighbor Algorithms , 2004, NIPS.

[66]  Andrew W. Moore,et al.  The IOC algorithm: efficient many-class non-parametric classification for high-dimensional data , 2004, KDD.

[67]  Steven Salzberg,et al.  A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features , 2004, Machine Learning.

[68]  Christos Faloutsos,et al.  Efficient and effective Querying by Image Content , 1994, Journal of Intelligent Information Systems.

[69]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[70]  Sargur N. Srihari,et al.  Fast k-nearest neighbor classification using cluster-based trees , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[71]  Chih-Jen Lin,et al.  A tutorial on?-support vector machines , 2005 .