A continuous training procedure for connected digit recognition

Algorithms for recognizing strings of connected words from whole word patterns (either templates or statistical models) have advanced to the point of high efficiency and accuracy. Although the computation rate of these connected word recognition algorithms remains high, advances in VLSI hardware make even the most ambitious connected word recognition tasks practical with todays technology. The greatest impediment to the successful utilization of connected word recognizers is the difficulty in extracting reliable, robust whole word reference patterns. In the past, connected word recognizers have relied on either isolated word reference patterns (which are trivially obtained), or reference patterns derived from limited context strings of words (e.g. the middle digit from strings of 3 digits). The resulting whole word reference patterns were adequate for slow rates of speech articulation, but proved inadequate when users spoke strings of words at high rates (e.g. on the order of 200-300 words per minute). To alleviate this difficulty, a training procedure for extracting whole word patterns from naturally spoken word strings has been implemented and is described here. The training procedure is essentially a k-means loop in which a set of known word strings is segmented into individual words based on matching an initial set of word reference patterns (typically a speaker independent set of isolated word reference patterns is used). The segmented words are then used to create an updated set of word reference patterns (either via clustering methods, for templates or via statistical techniques, for word models), which are then used in the segmental loop to give an updated set of word tokens from the labelled training set. This procedure is iterated until a stable set of whole word reference patterns is obtained. The training procedure was implemented and tested in a connected digits recognition task. For this task, string accuracies (on variable length strings with from 1-7 digits) on the order of 98-99% were obtained.