Exploring the Role of Attention in Modeling Embodied Language Acquisition

Language is about symbols and those symbols must be learned during infant development. Most recently, there has been an increased awareness of the essential role of inferences of speakers’ referential intentions in grounding those symbols. Experiments have shown that these inferences serve as an important driving force in language learning at a relatively early age. The challenge ahead is to develop formal models of language acquisition that can shed light on the leverage provided by embodiment and attention. This paper describes a computational model of embodied language acquisition that can simulate some of the formative steps in infant language acquisition. The novelty of our work is that the model shares multisensory information with a real agent in a first-person sense, and eye gaze is utilized as deictic reference to spot temporal correlations between different modalities. As a result, the system can build meaningful semantic representations that are grounded in the physical world. We test our model’s ability to associate spoken names of objects with their visually grounded meanings and compare the results of our approach with the case that does not use referential intentions.

[1]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[2]  Roger M. Cooper,et al.  The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. , 1974 .

[3]  R N Aslin,et al.  Statistical Learning by 8-Month-Old Infants , 1996, Science.

[4]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[5]  P. Bloom How children learn the meanings of words , 2000 .

[6]  Bartlett W. Mel SEEMORE: Combining Color, Shape, and Texture Histogramming in a Neurally Inspired Approach to Visual Object Recognition , 1997, Neural Computation.

[7]  Michael Tomasello,et al.  Learning words in nonostensive contexts , 1994 .

[8]  M. Brent,et al.  The role of exposure to isolated words in early vocabulary development , 2001, Cognition.

[9]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[10]  Chen Yu,et al.  Learning to recognize human action sequences , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.

[11]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[12]  S. Baron-Cohen Mindblindness: An Essay on Autism and Theory of Mind , 1997 .

[13]  Dare A. Baldwin,et al.  Do children with autism use the speaker's direction of gaze strategy to crack the code of language? , 1997, Child development.

[14]  Chen Yu,et al.  Attentional object spotting by integrating multimodal input , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[15]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[16]  Rajesh P. N. Rao,et al.  Embodiment is the foundation, not a level , 1996, Behavioral and Brain Sciences.

[17]  H. Gleitman,et al.  Human simulations of vocabulary learning , 1999, Cognition.

[18]  Dare A. Baldwin,et al.  Infants' reliance on a social criterion for establishing word-object relations. , 1996, Child development.

[19]  Rolf Adams,et al.  Seeded Region Growing , 1994, IEEE Trans. Pattern Anal. Mach. Intell..