Learning "Forgiving" Hash Functions: Algorithms and Large Scale Tests

The problem of efficiently finding similar items in a large corpus of high-dimensional data points arises in many real-world tasks, such as music, image, and video retrieval. Beyond the scaling difficulties that arise with lookups in large data sets, the complexity in these domains is exacerbated by an imprecise definition of similarity. In this paper, we describe a method to learn a similarity function from only weakly labeled positive examples. Once learned, this similarity function is used as the basis of a hash function to severely constrain the number of points considered for each lookup. Tested on a large real-world audio dataset, only a tiny fraction of the points (∼0.27%) are ever considered for each lookup. To increase efficiency, no comparisons in the original high-dimensional space of points are required. The performance far surpasses, in terms of both efficiency and accuracy, a state-of-the-art Locality-Sensitive-Hashing based technique for the same problem and data set.

[1]  Daphna Weinshall,et al.  Learning distance function by coding similarity , 2007, ICML '07.

[2]  Derek Hoiem,et al.  Computer vision for music identification , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Gregory Shakhnarovich,et al.  Learning task-specific similarity , 2005 .

[4]  Robert Tibshirani,et al.  Discriminant Adaptive Nearest Neighbor Classification , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Ton Kalker,et al.  A Highly Robust Audio Fingerprinting System , 2002, ISMIR.

[6]  John C. Platt,et al.  Distortion discriminant analysis for audio fingerprinting , 2003, IEEE Trans. Speech Audio Process..

[7]  Costas S. Iliopoulos,et al.  6th international conference on music information retrieval , 2005 .

[8]  Shumeet Baluja,et al.  Content Fingerprinting Using Wavelets , 2006 .

[9]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[10]  Jordan B. Pollack,et al.  Recursive Distributed Representations , 1990, Artif. Intell..

[11]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[12]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[13]  Misha Pavel,et al.  Adjustment Learning and Relevant Component Analysis , 2002, ECCV.

[14]  Tom M. Mitchell,et al.  Using the Future to Sort Out the Present: Rankprop and Multitask Learning for Medical Risk Evaluation , 1995, NIPS.

[15]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[16]  I. Tsang,et al.  Kernel relevant component analysis for distance metric learning , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..