Synthetic data for Arabic OCR system development

A system for the automatic generation of synthetic databases for the development or evaluation of Arabic word or text recognition systems (Arabic OCR) is presented. The proposed system works without any scanning of printed paper. Firstly Arabic text has to be typeset using a standard typesetting system. Secondly a noise-free bitmap of the document and the corresponding ground truth (GT) is automatically generated. Finally, an image distortion can be superimposed to the character or word image to simulate the expected real world noise of the intended application. All necessary modules are presented together with some examples. Special problems caused by specific features of Arabic, such as printing from right to left, many diacritical points, variation in the height of characters, and changes in the relative position to the writing line, are suggested. The synthetic data set was used to train and test a recognition system based on hidden Markov model (HMM), which was originally developed for German cursive script, for Arabic printed words. Recognition results with different synthetic data sets are presented.

[1]  Ponnuthurai N. Suganthan,et al.  Combining classifiers based on confidence values , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[2]  Najoua Essoukri Ben Amara,et al.  Classification of Arabic script using multiple sources of information: State of the art and perspectives , 2003, Document Analysis and Recognition.

[3]  Tapas Kanungo,et al.  Document degradation models and a methodology for degradation model validation , 1996 .

[4]  Abdel Belaïd,et al.  Modèle perceptif neuronal à vision globale-locale pour la reconnaissance de mots manuscrits arabes , 2002 .

[5]  Volker Märgner,et al.  HMM based approach for handwritten arabic word recognition using the IFN/ENIT - database , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[6]  Marc-Peter Schambach Model length adaptation of an HMM based cursive word recognition system , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[7]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[8]  Rui Zhang,et al.  Adaptive confidence transform based classifier combination for Chinese character recognition , 1998, Pattern Recognit. Lett..

[9]  Josef Kittler,et al.  A Framework for Classifier Fusion: Is It Still Needed? , 2000, SSPR/SPR.

[10]  Lambert Schomaker,et al.  Variants of the Borda count method for combining ranked classifier hypotheses , 2000 .

[11]  Kazuhiko Yamamoto,et al.  Structured Document Image Analysis , 1992, Springer Berlin Heidelberg.

[12]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  R. J. Green,et al.  Recognition of Handwritten Cursive Arabic Characters , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  V. F. Maergner,et al.  On benchmarking of document analysis systems , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[15]  Tapas Kanungo,et al.  Performance evaluation of two Arabic OCR products , 1999, Other Conferences.

[16]  Adel M. Alimi,et al.  An evolutionary neuro-fuzzy approach to recognize on-line Arabic handwriting , 1997, Proceedings of the Fourth International Conference on Document Analysis and Recognition.

[17]  Leslie Lamport,et al.  Latex : A Document Preparation System , 1985 .

[18]  Stephen V. Rice,et al.  The Fourth Annual Test of OCR Accuracy , 1995 .

[19]  Volker Märgner,et al.  A General Approach to Quality Evaluation of Document Segmentation Results , 1998, Document Analysis Systems.

[20]  Stephen V. Rice,et al.  Measuring the accuracy of page-reading systems , 1996 .

[21]  Volker Märgner,et al.  Script recognition using inhomogeneous P2DHMM and hierarchical search space reduction , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[22]  Sabri A. Mahmoud,et al.  Survey and bibliography of Arabic optical text recognition , 1995, Signal Process..

[23]  M. Pechwitz,et al.  IFN/ENIT: database of handwritten arabic words , 2002 .

[24]  Adnan Amin,et al.  Off-line Arabic character recognition: the state of the art , 1998, Pattern Recognit..

[25]  A. Dehghani,et al.  Off-line recognition of isolated Persian handwritten characters using multiple hidden Markov models , 2001, Proceedings International Conference on Information Technology: Coding and Computing.

[26]  Adnan Amin,et al.  Hand-printed arabic character recognition system using an artificial network , 1996, Pattern Recognit..

[27]  Tapas Kanungo,et al.  OmniPage vs. Sakhr: paired model evaluation of two Arabic OCR products , 1999, Electronic Imaging.

[28]  Mario Vento,et al.  Reliability Parameters to Improve Combination Strategies in Multi-Expert Systems , 1999, Pattern Analysis & Applications.

[29]  Henry S. Baird,et al.  Document image defect models , 1995 .

[30]  Mohammad S. Khorsheed,et al.  Recognising handwritten Arabic manuscripts using a single hidden Markov model , 2003, Pattern Recognit. Lett..

[31]  Pervez Ahmed,et al.  Arabic Character Recognition: Progress and Challenges , 2000, J. King Saud Univ. Comput. Inf. Sci..

[32]  Abdel Belaïd,et al.  Combination of local and global vision modelling for Arabic handwritten words recognition , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.