Efficiently supporting multiple similarity queries for mining in metric databases

Metric databases are databases where a metric distance function is defined for pairs of database objects. In such databases, similarity queries in the form of range queries or k-nearest neighbor queries are the most important queries. In traditional query processing, single queries are issued independently by different users. In many data mining applications, however, the database is typically explored by iteratively asking similarity queries for answers of previous similarity queries. In this paper, we introduce a generic scheme for such data mining algorithms and we investigate two orthogonal approaches, reducing I/O cost as well as CPU cost, to speed-up the processing of multiple similarity queries. The proposed techniques apply to any type of similarity query and to an implementation based on an index or using a sequential scan. Parallelization yields an additional impressive speed-up. An extensive performance evaluation confirms the efficiency of our approach.

[1]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[2]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[3]  Christian Böhm,et al.  Independent quantization: an index compression technique for high-dimensional data spaces , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[4]  Ulrich Bastian,et al.  The TYCHO Catalogue , 1997 .

[5]  K. Wakimoto,et al.  Efficient and Effective Querying by Image Content , 1994 .

[6]  Raymond T. Ng,et al.  Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining , 1996, IEEE Trans. Knowl. Data Eng..

[7]  Oliver Günther,et al.  Multidimensional access methods , 1998, CSUR.

[8]  Hanan Samet,et al.  The Design and Analysis of Spatial Data Structures , 1989 .

[9]  Hanan Samet,et al.  Ranking in Spatial Databases , 1995, SSD.

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Christian Böhm,et al.  Fast parallel similarity search in multimedia databases , 1997, SIGMOD '97.

[12]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[13]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[16]  Jaideep Srivastava,et al.  Web Mining: Pattern Discovery from World Wide Web Transactions , 1996 .

[17]  Bhavani M. Thuraisingham,et al.  Database Issues for Data Visualization: Supporting Interactive Database Explorartion , 1995, Workshop on Database Issues for Data Visualization.

[18]  Hans-Peter Kriegel,et al.  Algorithms for Characterization and Trend Detection in Spatial Databases , 1998, KDD.

[19]  Jiawei Han,et al.  Discovery of Spatial Association Rules in Geographic Information Databases , 1995, SSD.

[20]  C. V. Ramamoorthy,et al.  Knowledge and Data Engineering , 1989, IEEE Trans. Knowl. Data Eng..

[21]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[22]  Hans-Peter Kriegel,et al.  Efficient User-Adaptable Similarity Search in Large Multimedia Databases , 1997, VLDB.

[23]  Christian Böhm,et al.  A cost model for nearest neighbor search in high-dimensional data space , 1997, PODS.