Mining massive document collections by the WEBSOM method

A viable alternative to the traditional text-mining methods is the WEBSOM, a software system based on the Self-Organizing Map (SOM) principle. Prior to the searching or browsing operations, this method orders a collection of textual items, say, documents according to their contents, and maps them onto a regular two-dimensional array of map units. Documents that are similar on the basis of their whole contents will be mapped to the same or neighboring map units, and at each unit there exist links to the document database. Thus, while the searching can be started by locating those documents that match best with the search expression, further relevant search results can be found on the basis of the pointers stored at the same or neighboring map units, even if they did not match the search criterion exactly. This work contains an overview to the WEBSOM method and its performance, and as a special application, the WEBSOM map of the texts of Encyclopaedia Britannica is described.

[1]  Dieter Merkl Lessons Learned in Text Document Classification , 1997 .

[2]  Timo Honkela,et al.  Newsgroup Exploration with WEBSOM Method and Browsing Interface , 1996 .

[3]  Teuvo Kohonen,et al.  Things you haven't heard about the self-organizing map , 1993, IEEE International Conference on Neural Networks.

[4]  Thomas Hofmann,et al.  ProbMap - A probabilistic approach for mapping large document collections , 2000, Intell. Data Anal..

[5]  Samuel Kaski,et al.  Keyword selection method for characterizing text document maps , 1999 .

[6]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[7]  S. Finch,et al.  Unsupervised methods for finding linguistic categories , 1992 .

[8]  Mikko Kurimo,et al.  An Efficiently Focusing Large Vocabulary Language Model , 2002, ICANN.

[9]  Dieter Merkl Document Classification with Self-Organizing Maps , 1999 .

[10]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[11]  Krista Lagus Map of WSOM'97 Abstracts - Alternative Index , 1997 .

[12]  Xia Lin,et al.  Map Displays for Information Retrieval , 1997, J. Am. Soc. Inf. Sci..

[13]  Timo Honkela,et al.  Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map , 1995 .

[14]  Pasi Koikkalainen,et al.  Progress with the Tree-Structured Self-Organizing Map , 1994, ECAI.

[15]  Timo Honkela,et al.  Exploration of full-text databases with self-organizing maps , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[16]  Jay F. Nunamaker,et al.  A graphical, self-organizing approach to classifying electronic meeting output , 1997 .

[17]  Dieter Merkl,et al.  Exploration of text collections with hierarchical feature maps , 1997, SIGIR '97.

[18]  S. A. Shumsky Navigation in Databases Using Self-Organising Maps , 1999 .

[19]  Jakub Zavrel,et al.  The Language Environment and Syntactic Word-Class Acquisition. , 1996 .

[20]  F. Murtagh,et al.  A spatial user interface to the astronomical literature , 1998 .

[21]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[22]  T. Kohonen,et al.  Self-organizing semantic maps , 1989, Biological Cybernetics.

[23]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[24]  Samuel Kaski,et al.  Dimensionality reduction by random mapping: fast similarity computation for clustering , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[25]  Hsinchun Chen,et al.  Medical Data Mining on the Internet: Research on a Cancer Information System , 1999, Artificial Intelligence Review.

[26]  X. Lin,et al.  Visualization for the document space , 1992, Proceedings Visualization '92.

[27]  Dieter Merkl Structuring software for reuse-the case of self-organizing maps , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[28]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[29]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[30]  Giovanni Da San Martino Self-Organizing Maps in Natural Language Processing , 2003 .

[31]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[32]  Timo Honkela,et al.  Self-Organizing Maps of Document Collections: A New Approach to Interactive Exploration , 1996, KDD.

[33]  Timo Honkela,et al.  Creating an Order in Digital Libraries with Self-Organizing Maps , 1996 .

[34]  Timo Honkela,et al.  Very Large Two-Level SOM for the Browsing of Newsgroups , 1996, ICANN.

[35]  Dmitri Roussinov Internet search using adaptive visualization , 1999, CHI EA '99.

[36]  Yizong Cheng Convergence and Ordering of Kohonen's Batch Map , 1997, Neural Computation.

[37]  Jakub Zavrel Neural navigation interfaces for Information Retrieval: Are they more than an appealing idea? , 2004, Artificial Intelligence Review.

[38]  Fionn Murtagh,et al.  Neural networks and information extraction in astronomical information retrieval , 1996 .

[39]  Erkki Oja,et al.  Kohonen Maps , 1999, Encyclopedia of Machine Learning.

[40]  Andreas Rauber,et al.  Uncovering the Hierarchical Structure of Text Archives by Using an Unsupervised Neural Network with Adaptive Architecture , 2000, PAKDD.

[41]  Krista Lagus,et al.  Text Retrieval Using Self-Organized Document Maps , 2002, Neural Processing Letters.

[42]  Luís B. Almeida,et al.  Improving the Learning Speed in Topological Maps of Patterns , 1990 .

[43]  Gary Marchionini,et al.  A self-organizing semantic map for information retrieval , 1991, SIGIR '91.

[44]  Krista Lagus,et al.  Text mining with the WEBSOM , 2000 .

[45]  Timo Honkela,et al.  WEBSOM - Self-organizing maps of document collections , 1998, Neurocomputing.

[46]  Hsinchun Chen,et al.  Internet Categorization and Search: A Self-Organizing Approach , 1996, J. Vis. Commun. Image Represent..

[47]  Timo Honkela,et al.  Self-Organizing Maps In Natural Language Processing , 1997 .

[48]  James A. Wise,et al.  The Ecological Approach to Text Visualization , 1999, J. Am. Soc. Inf. Sci..

[49]  B. Yegnanarayana,et al.  Artificial Neural Networks , 2004 .

[50]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[51]  Eugene Charniak,et al.  Statistical language learning , 1997 .

[52]  Marshall Ramsey,et al.  Information forage through adaptive visualization , 1998, DL '98.

[53]  Samuel Kaski,et al.  Self organization of a massive text document collection , 1999 .

[54]  D. Merkl Content-based software classification by self-organization , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.