Learning nongenerative grammatical models for document analysis

We present a general approach for the hierarchical segmentation and labeling of document layout structures. This approach models document layout as a grammar and performs a global search for the optimal parse based on a grammatical cost function. Our contribution is to utilize machine learning to discriminatively select features and set all parameters in the parsing process. Therefore, and unlike many other approaches for layout analysis, ours can easily adapt itself to a variety of document analysis problems. One need only specify the page grammar and provide a set of correctly labeled pages. We apply this technique to two document image analysis tasks: page layout structure extraction and mathematical expression interpretation. Experiments demonstrate that the learned grammars can be used to extract the document structure in 57 files from the UWIII document image database. We also show that the same framework can be used to automatically interpret printed mathematical expressions so as to recreate the original LaTeX

[1]  P. A. Chou,et al.  Recognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar , 1989, Other Conferences.

[2]  Alan Conway,et al.  Page grammars and page parsing. A syntactic approach to document layout recognition , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[3]  Richard Zanibbi,et al.  Applying compiler techniques to diagram recognition , 2002, Object recognition supported by user interaction for service robots.

[4]  Daniel X. Le,et al.  Automated labeling in document images , 2000, IS&T/SPIE Electronic Imaging.

[5]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[6]  Sargur N. Srihari,et al.  Knowledge-based derivation of document logical structure , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[7]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Jesse F Hull Recognition of mathematics using a two-dimensional trainable context-free grammar , 1996 .

[10]  Martin Kay,et al.  Algorithm schemata and data structures in syntactic processing , 1986 .

[11]  Philip A. Chou,et al.  Turbo recognition: a statistical approach to layout analysis , 2000, IS&T/SPIE Electronic Imaging.

[12]  Mahesh Viswanathan,et al.  Syntactic Segmentation and Labeling of Digitized Pages from Technical Journals , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Robert M. Haralick,et al.  CD-ROM document database standard , 1993, Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR '93).

[14]  Dan Klein,et al.  A* Parsing: Fast Exact Viterbi Parse Selection , 2003, NAACL.

[15]  Mahesh Viswanathan,et al.  Document recognition: an attribute grammar approach , 1996, Electronic Imaging.

[16]  Azriel Rosenfeld,et al.  Document structure analysis algorithms: a literature survey , 2003, IS&T/SPIE Electronic Imaging.

[17]  Nicholas E. Matsakis Recognition of Handwritten Mathematical Expressions , 1999 .

[18]  Eugene Charniak,et al.  Edge-Based Best-First Chart Parsing , 1998, VLC@COLING/ACL.

[19]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[20]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[21]  Paul A. Viola,et al.  Efficient geometric algorithms for parsing in two dimensions , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[22]  Paul A. Viola,et al.  Ambiguity and Constraint in Mathematical Expression Recognition , 1998, AAAI/IAAI.

[23]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.