Grounding Language in Descriptions of Scenes - eScholarship

Grounding Language in Descriptions of Scenes Paul Williams (pwilly@cs.utexas.edu) Department of Computer Sciences, The University of Texas at Austin 1 University Station C0500, Austin, Texas 78712 USA Risto Miikkulainen (risto@cs.utexas.edu) Department of Computer Sciences, The University of Texas at Austin 1 University Station C0500, Austin, Texas 78712 USA by machines to have directly grounded meanings as well as provide insight into how grounding may be accomplished by the human brain. Artificial neural network (ANN) architectures provide strong candidates for computational models of symbol grounding. Several such architectures have been previously proposed, as will be reviewed in the following section. While these models have provided many insights, the problem is by no means solved. First, most of these models are based on supervised learning, utilizing corrective feedback. Assuming that symbol grounding is a developmental cognitive process, it is unclear what the source of the error signals might be. Second, the previous models are often opaque, i.e. difficult to interpret. A model of symbol grounding should ideally do more than simply show that grounding can be achieved; it should demonstrate how the grounding occurs and what grounding looks like on a conceptual level. This paper presents the GLIDES (Grounding Language in DEscriptions of Scenes) model, a neural network architec- ture that learns to ground linguistic descriptions into visual scenes. The model uses an unsupervised learning proce- dure based on self-organizing maps and Hebbian adaptations, learning associations between descriptions and scenes. It al- lows directly examining the representations and associations that are formed from the linguistic and visual inputs. The model therefore provides a unique framework for studying the grounding task. GLIDES was evaluated in two sets of experiments. The first set assesses the model’s symbol grounding by evaluating the scenes it generates for linguistic test inputs of three types: (1) single words/concepts, (2) complex descriptions present in the training set, and (3) complex novel descriptions. The second set validates the model’s grounding by evaluating its language generation ability when describing visual samples from two test sets: (1) scenes from the training set and (2) novel scenes. The results demonstrate unique insights into symbol grounding, including how many-to-many mappings between symbols and referents can be maintained and how concepts can be formed from cooccurrence relationships. Abstract The problem of how abstract symbols, such as those in sys- tems of natural language, may be grounded in perceptual in- formation presents a significant challenge to several areas of research. This paper presents the GLIDES model, a neural network architecture that shows how this symbol-grounding problem can be solved through learned relationships between simple visual scenes and linguistic descriptions. Unlike previ- ous models of symbol grounding, the model’s learning is com- pletely unsupervised, utilizing the principles of self organiza- tion and Hebbian learning and allowing direct visualization of how concepts are formed and grounding occurs. Two sets of experiments were conducted to evaluate the model. In the first set, linguistic test stimuli were presented and the scenes that were generated by the model were evaluated as the grounding of the language. In the second set, the model was presented with visual test samples and its language generation capabili- ties based on the grounded representations were assessed. The results demonstrate that symbols can be grounded based on associations of perceptual and linguistic representations, and the grounding can be made transparent. This transparency leads to unique insights into symbol grounding, including how many-to-many mappings between symbols and referents can be maintained and how concepts can be formed from cooccur- rence relationships. Introduction In order to create an intelligent symbol system, symbols must be grounded in perceptual information (Harnad, 1990; Barsa- lou, 1999). Regardless of how intelligent the behavior of a system seems, if its symbols depend on external interpreta- tion to attain meaning then it cannot be said to have achieved understanding. For understanding to occur, the symbols must have inherent meaning in terms of the system’s experiences of the external world. In order to develop a symbol system, it is therefore necessary to understand how symbols become grounded in their perceptual correlates (Cottrell, Bartell, & Haupt, 1990; Chalmers, 1992). Technically, symbol grounding means establishing percep- tual categories and associating these categories with abstract tokens. In order to do that, it is first necessary to determine the commonalities of all the external objects to which a symbol refers that are distinct from attributes of objects in other cate- gories. This process involves emphasizing the differences be- tween categories and minimizing the differences within cat- egories, a process called “categorical perception” (Harnad, 1987). Once the boundaries of a category have been estab- lished, it can be associated with an abstract token, at which point symbol grounding has occurred. Successfully model- ing this process computationally could allow symbols used Prior Grounding Research An early solution to the symbol-grounding problem was pro- posed by Harnad (1993), a combined connectionist/symbolic model trained by supervised learning. Similar models have been used in several studies since, successfully demonstrat- ing the strength of connectionist learning in the grounding task (Cangelosi, Greco, & Harnad, 2000; Riga, Cangelosi, &