Self organization of a massive document collection

This article describes the implementation of a system that is able to organize vast document collections according to textual similarities. It is based on the self-organizing map (SOM) algorithm. As the feature vectors for the documents statistical representations of their vocabularies are used. The main goal in our work has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data. In a practical experiment we mapped 6,840,568 patent abstracts onto a 1,002,240-node SOM. As the feature vectors we used 500-dimensional vectors of stochastic figures obtained as random projections of weighted word histograms.

[1]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[2]  Teuvo Kohonen,et al.  Comparison of SOM Point Densities Based on Different Criteria , 1999, Neural Computation.

[3]  Jay F. Nunamaker,et al.  Information Visualization for Collaborative Computing , 1998, Computer.

[4]  Jorma Laaksonen,et al.  SOM_PAK: The Self-Organizing Map Program Package , 1996 .

[5]  J. C. Scholtes Unsupervised learning and the information retrieval problem , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[6]  Dieter Merkl,et al.  Text classification with self-organizing maps: Some lessons learned , 1998, Neurocomputing.

[7]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[8]  John W. Tukey,et al.  Exploratory Data Analysis , 1980, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[9]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[10]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[11]  Samuel Kaski,et al.  Self organization of a massive text document collection , 1999 .

[12]  Samuel Kaski,et al.  Keyword selection method for characterizing text document maps , 1999 .

[13]  T. Kohonen,et al.  Workshop on Self-Organizing Maps (WSOM'97), Espoo, Finland, June 4-6, 1997 , 1997 .

[14]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[15]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[16]  Santosh S. Vempala,et al.  Latent Semantic Indexing , 2000, PODS 2000.

[17]  Pasi Koikkalainen,et al.  Progress with the Tree-Structured Self-Organizing Map , 1994, ECAI.

[18]  Willem J. Heiser,et al.  13 Theory of multidimensional scaling , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[19]  Timo Honkela,et al.  Self-Organizing Maps of Very Large Document Collections: Justification for the WEBSOM Method , 1998 .

[20]  Timo Honkela,et al.  Newsgroup Exploration with WEBSOM Method and Browsing Interface , 1996 .

[21]  Teuvo Kohonen,et al.  Self-Organization of Very Large Document Collections: State of the Art , 1998 .

[22]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[23]  Timo Honkela,et al.  Creating an Order in Digital Libraries with Self-Organizing Maps , 1996 .

[24]  Allen Gersho,et al.  Asymptotically optimal block quantization , 1979, IEEE Trans. Inf. Theory.

[25]  Teuvo Kohonen,et al.  Things you haven't heard about the self-organizing map , 1993, IEEE International Conference on Neural Networks.

[26]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[27]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[28]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[29]  Thomas Eriksson,et al.  Vector Quantization in Speech Coding. Variable Rate, Memory and Lattice Quantization , 1996 .

[30]  Susan T. Dumais,et al.  Landauer ? Indexing by Latent Semantic Analysis , 1990 .

[31]  Yizong Cheng Convergence and Ordering of Kohonen's Batch Map , 1997, Neural Computation.

[32]  J. Makhoul,et al.  Vector quantization in speech coding , 1985, Proceedings of the IEEE.

[33]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[34]  Luís B. Almeida,et al.  Improving the Learning Speed in Topological Maps of Patterns , 1990 .

[35]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[36]  Teuvo Kohonen,et al.  Exploration of very large databases by self-organizing maps , 1997, Proceedings of International Conference on Neural Networks (ICNN'97).

[37]  Fionn Murtagh,et al.  Neural networks and information extraction in astronomical information retrieval , 1996 .

[38]  Timo Honkela,et al.  Very Large Two-Level SOM for the Browsing of Newsgroups , 1996, ICANN.

[39]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[40]  Xia Lin,et al.  Map Displays for Information Retrieval , 1997, J. Am. Soc. Inf. Sci..

[41]  J. Douglas Carroll,et al.  14 Multidimensional scaling and its applications , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[42]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[43]  A. Householder,et al.  Discussion of a set of points in terms of their mutual distances , 1938 .

[44]  Dmitri Roussinov,et al.  A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation , 1998 .