Contextual Relations of Words in Grimm Tales, Analyzed by Self-Organizing Map

Semantic roles of words in natural languages are reeected by the contexts in which they occur. These roles can explicitly be visualized by the Self-Organizing Map (SOM). In the experiments reported in this work the source data consisted of the raw text of Grimm fairy tales without any prior syntactic or semantic categorization of the words. The algorithm was able to create diagrams that seem to comply reasonably well with the traditional syntactical categorizations and human intuition about the semantics of the words. It has earlier been shown that the Self-Organizing Map (SOM) can be applied to the visual-ization of contextual roles of words, i.e., similarities in their usage in short contexts formed of adjacent words 4]. This paper demonstrates that such relations or roles are also statistically reeected in unrestricted, even quaint natural expressions. The source material chosen for this experiment consisted of 200 Grimm tales (English translation). In most practical applications of the SOM, the input to the map algorithm is derived from some measurements, usually after their preprocessing. In such cases, the input vectors are supposed to have metric relations. Interpretation of languages, on the contrary, must be based on the processing of sequences of discrete symbols. If the words were encoded numerically, the ordered sets formed of them could also be compared mutually as well as with reference expressions. However, as no numerical value of the code should imply any order to the words themselves, it will be necessary to use uncorrelated vectors for encoding. The simplest method to introduce uncorrelated codes is to assign a unit vector for each word. When all diierent words in the input material are listed, a code vector can be deened to have as many components as there are words in the list. This method, however, is only practicable in very small experiments. If the vocabulary is large as in the present experiments, we may then encode the words by quasi-orthogonal random vectors of a much smaller dimensionality 4]. To create a map of discrete symbols that occur within the sentences, each symbol must be presented in the due context. The context may consist of the immediate surroundings of the word in the text. Application of the self-organizing maps to natural language processing has been described earlier in, e.g., 2], 3], 4], 5], and 6].