Graph transformer networks for image recognition

Consider a system that takes the image of a check and returns the check amount. This system locates the numerical amount, recognizes digits or other symbols, and parses the check amount. Accuracy should remain high despite countless variations in check layout, writing style or amount grammar. From an engineering perspective, one must design components for locating the amount, segmenting characters, recognizing digits, and parsing the amount text. Yet it is very difficult to locate the amount without identifying that it is composed of characters that mostly resemble digits and form a meaningful check amount (not a date or a routing number). Purely sequential approaches do not work. Components must interact, form hypotheses and backtrack erroneous decisions. The orchestration is difficult to design and costly to maintain. From a statistical perspective, one seeks to estimate and compare the posterior probabilities P (Y |X) where variable X represents a check image and variable Y represents a check amount. Let us define a suitable parametric model pθ(y|x), gather data pairs (xi, yi), and maximize the likelihood ∑ i log pθ(yi|xi). Such a direct approach leads to problems of unpractical sizes. It is therefore common to manually annotate some pairs (xi, yi) with detailled information such as isolated character images T , character codes C, or sequences S of character codes. One can then model P (C|T ) and P (Y |S) and obtain components such as a character recognizer or an amount parser. The statistical perspective suggests a principled way to orchestrate the interaction of these components: let the global model pθ(y|x) be expressed as a composition of submodels such as pθ(c|t) and pθ(y|s). The submodels are first fit using the detailled data. The resulting parameters are used as a bias when fitting the global model pθ(y|x) using the initial data pairs (yi|xi). This bias can be viewed as a capacity control tool for structural risk minimization (Vapnik, 1982). Model composition works nicely with generative models where one seeks to estimate the joint density P (X,Y ) instead of the posterior P (Y |X). For instance, Hidden Markov Models (HMM) for speech recognition (Rabiner, 1989) use the decomposition