Text line segmentation and word recognition in a system for general writer independent handwriting recognition

We present a system for recognizing unconstrained English handwritten text based on a large vocabulary. We describe the three main components of the system, which are preprocessing, feature extraction and recognition. In the preprocessing phase the handwritten texts are first segmented into lines. Then each line of text is normalized with respect to of skew, slant, vertical position and width. After these steps, text lines are segmented into single words. For this purpose distances between connected components are measured. Using a threshold, the distances are divided into distances within a word and distances between different words. A line of text is segmented at positions where the distances are larger than the chosen threshold. From each image representing a single word, a sequence of features is extracted. These features are input to a recognition procedure which is based on hidden Markov models. To investigate the stability of the segmentation algorithm the threshold that separates intra- and inter-word distances from each other is varied. If the threshold is small many errors are caused by over-segmentation, while for large thresholds under-segmentation errors occur. The best segmentation performance is 95.56% correctly segmented words, tested on 541 text lines containing 3899 words. Given a correct segmentation rate of 95.56%, a recognition rate of 73.45% on the word level is achieved.

[1]  Horst Bunke,et al.  Using a Statistical Language Model to Improve the Performance of an HMM-Based Cursive Handwriting Recognition System , 2001, Int. J. Pattern Recognit. Artif. Intell..

[2]  Ching Y. Suen,et al.  Computer recognition of unconstrained handwritten numerals , 1992, Proc. IEEE.

[3]  Giovanni Seni,et al.  External word segmentation of off-line handwritten text lines , 1994, Pattern Recognit..

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Michael T. Goodrich,et al.  Data structures and algorithms in C++ , 2003 .

[6]  Horst Bunke,et al.  The IAM-database: an English sentence database for offline handwriting recognition , 2002, International Journal on Document Analysis and Recognition.

[7]  Torsten Caesar,et al.  Sophisticated topology of hidden Markov models for cursive script recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[8]  Adam Drozdek,et al.  Data structures and algorithms in C , 1995 .

[9]  Emmanuel Augustin,et al.  A2iA Check Reader: a family of bank check recognition systems , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[10]  Horst Bunke,et al.  Automated Reading of Cheque Amounts , 2000, Pattern Analysis & Applications.

[11]  Gyeonghwan Kim,et al.  An architecture for handwritten text recognition systems , 1999, International Journal on Document Analysis and Recognition.

[12]  Horst Bunke,et al.  Handbook of Character Recognition and Document Image Analysis , 1997 .

[13]  Amlan Kundu,et al.  HANDWRITTEN WORD RECOGNITION USING HIDDEN MARKOV MODEL , 1997 .

[14]  F. Badawi,et al.  Structures and Algorithms in Stochastic Realization Theory and the Smoothing Problem , 1980 .

[15]  J.-C. Simon,et al.  Off-line cursive word recognition , 1992, Proc. IEEE.

[16]  Leonardo Maria Reyneri,et al.  Beatrix: A self-learning system for off-line recognition of handwritten texts , 1997, Pattern Recognit. Lett..

[17]  Uma Mahadevan,et al.  Gap metrics for word separation in handwritten lines , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[18]  Steve Young,et al.  The HTK book , 1995 .

[19]  Horst Bunke,et al.  A full English sentence database for off-line handwriting recognition , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[20]  Alfred V. Aho,et al.  Data Structures and Algorithms , 1983 .