Context‐dependent acoustic subword modeling for connected digit recognition

Accurate and robust connected digit recognition is essential for a wide range of telecommunication services. Based on training and testing using only clean network digit data, and using the same whole‐word model architecture as in the TI/NIST connected digit testing, the string error rate increased from less than 1% to more than 5%. The performance degraded even further when evaluated on data collected with different network conditions. Most of the observed errors were caused by changing channel characteristics, highly variable digit pronunciations, and inadequate modeling of cross‐digit coarticulation. Results are presented for a number of context‐dependent whole‐word and subword modeling techniques developed to overcome some of the above problems. The most effective one is a new acoustic subword modeling approach that assumes that each digit model consists of three parts, namely, head, body, and tail subword units. Multiple heads and tails are also allowed, one for each of the 11 possible preceding and ...