Fast optical character recognition through glyph hashing for document conversion

This paper proposes a glyph hashing approach to optical character recognition with applications in document conversion. The viability and efficiency of the approach is tested through its implementation in a print driver on 68,987 PDF documents containing 1.15 billion characters. Results indicate that a hash table with (a) 3.2 million hashes is sufficient to represent all characters from these documents, and (b) 480 fonts are sufficient to cover over 90% of these documents. Glyph recognizing experiments indicate that 80% of unique character glyphs and over 96% of all characters from unseen documents can be found in a hash table built using all 68,987 documents. The hashing approach is used to not only recognize the character codes but also, size, style (bold, italic, etc), and font name. We found that the hashing approach can scale to hundreds of fonts and thousands of characters per font. Further, it is extremely fast and can recognize over 100,000 characters per second. Owing to its speed, such a hashing approach can complement any existing OCR system by acting as a pre-filter to produce a 4-5 times speedup during document conversion.