Self-Organization of Very Large Document Collections: State of the Art

The Self-Organizing Map (SOM) forms a nonlinear projection from a high-dimensional data manifold onto a low-dimensional grid. A representative model of some subset of data is associated with each grid point. The SOM algorithm computes an optimal collection of models that approximates the data in the sense of some error criterion and also takes into account the similarity relations of the models. The models then become ordered on the grid according to their similarity. When the SOM is used for the exploration of statistical data, the data vectors can be approximated by models of the same dimensionality. When mapping documents, one can represent them statistically by their word frequency histograms or some reduced representations of the histograms that can be regarded as data vectors. We have made SOMs of collections of over one million documents. Each document is mapped onto some grid point, with a link from this point to the document database. The documents are ordered on the grid according to their contents and neighboring documents can be browsed readily. Keywords or key texts can be used to search for the most relevant documents first. New effective coding and computing schemes of the mapping are described.