A Computational Model of Embodied Language Learning

Language is about symbols and those symbols must be grounded in the physical environment during human development. Most recently, there has been an increased awareness of the essential role of inferences of speakersU referential intentions in grounding those symbols. Experiments have shown that these inferences as revealed in eye, head and hand movements serve as an important driving force in language learning at a relatively early age. The challenge ahead is to develop formal models of language acquisition that can shed light on the leverage provided by embodiment. We present an implemented computational model of embodied language acquisition that learns words from natural interactions with users. The system can be trained in unsupervised mode in which users perform everyday tasks while providing natural language descriptions of their behaviors. We collect acoustic signals in concert with user-centric multisensory information from nonspeech modalities, such as userUs perspective video, gaze positions, head directions and hand movements. A multimodal learning algorithm is developed that firstly spots words from continuous speech and then associates action verbs and object names with their grounded meanings. The central idea is to make use of non-speech contextual information to facilitate word spotting, and utilize userUs attention as deictic reference to discover temporal correlations of data from different modalities to build lexical items. We report the results of a series of experiments that demonstrate the effectiveness of our approach.

[1]  R. Bischoff,et al.  Integrating vision, touch and natural language in the control of a situation-oriented behavior-based humanoid robot , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[2]  Harold L. Somers Aligning Phonetic Segments for Children's Articulation Assessment , 1999, Comput. Linguistics.

[3]  Grzegorz Kondrak,et al.  Identifying Cognates by Phonetic and Semantic Similarity , 2001, NAACL.

[4]  Rodney A. Brooks,et al.  Elephants don't play chess , 1990, Robotics Auton. Syst..

[5]  P. Jusczyk,et al.  Phonotactic cues for segmentation of fluent speech by infants , 2001, Cognition.

[6]  David Sankoff,et al.  Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison , 1983 .

[7]  M. Brent,et al.  On the discovery of novel wordlike units from utterances: an artificial-language study with implications for native-language acquisition. , 1999, Journal of experimental psychology. General.

[8]  Dare A. Baldwin,et al.  Infants' reliance on a social criterion for establishing word-object relations. , 1996, Child development.

[9]  A. Vinter,et al.  PARSER: A Model for Word Segmentation , 1998 .

[10]  M. Brent,et al.  On the discovery of novel wordlike units from utterances: an artificial-language study with implications for native-language acquisition. , 1999, Journal of experimental psychology. General.

[11]  Alison Gopnik,et al.  Names, relational words, and cognitive development in English and Korean speakers: Nouns are not always learned before verbs. , 1995 .

[12]  Rohini K. Srihari,et al.  Computational models for integrating linguistic and visual information: A survey , 2004, Artificial Intelligence Review.

[13]  A F Bobick,et al.  Movement, activity and action: the role of knowledge in the perception of motion. , 1997, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[14]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  T. Tardif Nouns are not always learned before verbs : Evidence from Mandarin speakers' early vocabularies , 1996 .

[16]  Roger M. Cooper,et al.  The control of eye fixation by the meaning of spoken language: A new methodology for the real-time investigation of speech perception, memory, and language processing. , 1974 .

[17]  Grzegorz Kondrak Alignment of Phonetic Sequences , 1999 .

[18]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[19]  Bernt Schiele,et al.  Recognition without Correspondence using Multidimensional Receptive Field Histograms , 2004, International Journal of Computer Vision.

[20]  Harold L. Somers Similarity Metrics for Aligning Children's Articulation Data , 1998, COLING-ACL.

[21]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[22]  Carl de Marcken,et al.  The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[23]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[24]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[25]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[26]  Chen Yu,et al.  Learning to recognize human action sequences , 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002.


[28]  C Snow,et al.  Child language data exchange system , 1984, Journal of Child Language.

[29]  Ian Horswill,et al.  Integrating vision and natural language without central models , 1995 .

[30]  M. Brent Speech segmentation and word discovery: a computational perspective , 1999, Trends in Cognitive Sciences.

[31]  P. Jusczyk,et al.  Infants’ sensitivity to allophonic cues for word segmentation , 1999, Perception & psychophysics.

[32]  D. D. Richards,et al.  The episodic memory model of conceptual development: An integrative viewpoint , 1986 .

[33]  Grzegorz Kondrak,et al.  A New Algorithm for the Alignment of Phonetic Sequences , 2000, ANLP.

[34]  Jeff B. Pelz,et al.  Development of a virtual laboratory for the study of complex human behavior , 1999, Electronic Imaging.

[35]  Michael R. Brent,et al.  Toward a Unified Model of Lexical Acquisition and Lexical Access , 1997 .

[36]  Alex Pentland,et al.  Real-time American Sign Language recognition from video using hidden Markov models , 1995 .

[37]  Steve R. Howell,et al.  Modelling Language Acquisition : Lexical Grounding Through Perceptual Features , 2001 .

[38]  Yasuo Kuniyoshi,et al.  Qualitative Recognition of Ongoing Human Action Sequences , 1993, IJCAI.

[39]  J. Kruskal An Overview of Sequence Comparison: Time Warps, String Edits, and Macromolecules , 1983 .

[40]  H. Gleitman,et al.  Human simulations of vocabulary learning , 1999, Cognition.

[41]  P. Jusczyk,et al.  Infants′ Detection of the Sound Patterns of Words in Fluent Speech , 1995, Cognitive Psychology.

[42]  Bartlett W. Mel SEEMORE: Combining Color, Shape, and Texture Histogramming in a Neurally Inspired Approach to Visual Object Recognition , 1997, Neural Computation.

[43]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.


[45]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[46]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[47]  Rolf Adams,et al.  Seeded Region Growing , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[48]  Michael A. Covington,et al.  An Algorithm to Align Words for Historical Comparison , 1996, Comput. Linguistics.

[49]  D H Ballard,et al.  Hand-eye coordination during sequential tasks. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[50]  Jeffrey Mark Siskind Grounding language in perception , 2004, Artificial Intelligence Review.

[51]  Dare A. Baldwin,et al.  Early referential understanding: Infants' ability to recognize referential acts for what they are. , 1993 .

[52]  Yong Rui,et al.  Segmenting visual actions based on spatio-temporal motion patterns , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[53]  D. So Philosophy in the Flesh: the Embodied Mind and its Challenge to Western Thought , 2000 .

[54]  Dana H. Ballard,et al.  Category Learning Through Multimodality Sensing , 1998, Neural Computation.

[55]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[56]  Alex Pentland,et al.  Invariant features for 3-D gesture recognition , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[57]  Jeffrey Mark Siskind,et al.  Visual event perception , 1997 .

[58]  P. Bloom How children learn the meanings of words , 2000 .

[59]  M. Tomasello,et al.  Language development : the essential readings , 2001 .

[60]  Michael Tomasello,et al.  Beyond Names for Things: Young Children's Acquisition of Verbs , 1997 .

[61]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[62]  A. Cutler,et al.  Rhythmic cues to speech segmentation: Evidence from juncture misperception , 1992 .

[63]  Rajesh P. N. Rao,et al.  Embodiment is the foundation, not a level , 1996, Behavioral and Brain Sciences.

[64]  M. Land,et al.  The Roles of Vision and Eye Movements in the Control of Activities of Daily Living , 1998, Perception.

[65]  V. D. de Sa Category learning through multimodality sensing. , 1998, Neural computation.

[66]  T. A. Cartwright,et al.  Distributional regularity and phonotactic constraints are useful for segmentation , 1996, Cognition.

[67]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[68]  Dare A. Baldwin,et al.  Do children with autism use the speaker's direction of gaze strategy to crack the code of language? , 1997, Child development.

[69]  Jerome A. Feldman,et al.  Extending Embodied Lexical Development , 1998 .

[70]  Alex Pentland,et al.  Learning words from sights and sounds: a computational model , 2002, Cogn. Sci..

[71]  Roberto Brunelli,et al.  Person identification using multiple cues , 1995, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Teuvo Kohonen,et al.  Improved versions of learning vector quantization , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[73]  Kim Plunkett,et al.  Theories of early language acquisition , 1997, Trends in Cognitive Sciences.

[74]  John R. Anderson,et al.  Tracing Eye Movement Protocols with Cognitive Process Models , 1998 .