Emergent Gestural Scores in a Recurrent Neural Network Model of Vowel Harmony

In this paper, we present the results of neural network modeling of speech production. We introduce GestNet, a sequence-to-sequence, encoder-decoder neural network architecture in which a string of input symbols is translated into sequences of vocal tract articulator movements. We train our models to produce movements of lip and tongue body articulators consistent with a pattern of stepwise vowel height harmony. Though we provide our models with no linguistic structure, they reliably learn this harmony pattern. In addition, by probing these models we find evidence of emergent linguistic structure. Specifically, we examine patterns of encoder-decoder attention (degree of influence of specific input segments on model outputs) and find that they resemble the patterns of gestural activation assumed within the Gestural Harmony Model, a model of harmony built upon the representations of Articulatory Phonology. This result is significant as it lends support to one of the central claims of the Gestural Harmony Model: that harmony is the result of the harmony-triggering gestures extending to overlap the gestures of surrounding segments.

[1]  Connor Mayer,et al.  Phonotactic learning with neural language models , 2020, SCIL.

[2]  Brandon Prickett,et al.  Learning biases in opaque interactions , 2019, Phonology.

[3]  Eric R. Rosen Lexical strata and phonotactic perplexity minimization , 2021, SCIL.

[4]  Klinton Bicknell,et al.  Using LSTMs to Assess the Obligatoriness of Phonological Distinctive Features for Phonotactic Learning , 2019, ACL.

[5]  L Saltzman Elliot,et al.  A Dynamical Approach to Gestural Patterning in Speech Production , 1989 .

[6]  The representation of vowel height in phonology , 1996 .

[7]  William D. Marslen-Wilson,et al.  A Connectionist Model of Phonological Representation in Speech Perception , 1995, Cogn. Sci..

[8]  Caitlin Smith A Gestural Account of Neutral Segment Asymmetries in Harmony , 2016 .

[9]  Louis Goldstein,et al.  Towards an articulatory phonology , 1986, Phonology.

[10]  R. Kirchner Synchronic chain shifts in optimality theory , 2008 .

[11]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[12]  Mans Hulden,et al.  Sound Analogies with Phoneme Embeddings , 2018 .

[13]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[14]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[15]  Louis Goldstein,et al.  Articulatory gestures as phonological units , 1989, Phonology.

[16]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[17]  C A Fowler,et al.  Coordination and Coarticulation in Speech Production , 1993, Language and speech.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Caitlin Smith Stepwise height harmony as partial transparency * , 2019 .

[20]  Jennifer M. Rodd,et al.  Recurrent Neural-Network Learning of Phonological Regularities in Turkish , 1997, CoNLL.

[21]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[22]  Dani Byrd,et al.  TADA: An enhanced, portable Task Dynamics model in MATLAB , 2004 .

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Caitlin Smith Partial Height Harmony as Partial Transparency , 2020 .

[25]  Sam Tilsen A different view of gestural activation: learning gestural parameters and activation with an RNN , 2020 .