'N-Body' Problems in Statistical Learning

We present efficient algorithms for all-point-pairs problems, or 'N-body'- like problems, which are ubiquitous in statistical learning. We focus on six examples, including nearest-neighbor classification, kernel density estimation, outlier detection, and the two-point correlation. These include any problem which abstractly requires a comparison of each of the N points in a dataset with each other point and would naively be solved using N2 distance computations. In practice N is often large enough to make this infeasible. We present a suite of new geometric techniques which are applicable in principle to any 'N-body' computation including large-scale mixtures of Gaussians, RBF neural networks, and HMM's. Our algorithms exhibit favorable asymptotic scaling and are empirically several orders of magnitude faster than the naive computation, even for small datasets. We are aware of no exact algorithms for these problems which are more efficient either empirically or theoretically. In addition, our framework yields simple and elegant algorithms. It also permits two important generalizations beyond the standard all-point-pairs problems, which are more difficult. These are represented by our final examples, the multiple two-point correlation and the notorious n-point correlation.

[1]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[2]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1976, TOMS.

[3]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[4]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[5]  F. Frances Yao,et al.  Computational Geometry , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  Andrew W. Moore,et al.  Multiresolution Instance-Based Learning , 1995, IJCAI.

[8]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[9]  Mary S. Lee Cached Suucient Statistics for Eecient Machine Learning with Large Datasets 1. Caching Suucient Statistics , 1997 .

[10]  Andrew W. Moore,et al.  Efficient Locally Weighted Polynomial Regression Predictions , 1997, ICML.

[11]  I. Szapudi A New Method for Calculating Counts in Cells , 1997, astro-ph/9711221.

[12]  Andrew W. Moore,et al.  Very Fast EM-Based Mixture Model Clustering Using Multiresolution Kd-Trees , 1998, NIPS.

[13]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[14]  Andrew W. Moore,et al.  Accelerating exact k-means algorithms with geometric reasoning , 1999, KDD '99.

[15]  S. Colombi,et al.  Cosmic statistics of statistics , 1999, astro-ph/9912289.

[16]  Andrew W. Moore,et al.  The Anchors Hierarchy: Using the Triangle Inequality to Survive High Dimensional Data , 2000, UAI.