Consider a system that takes the image of a check and returns the check amount. This system locates the numerical amount, recognizes digits or other symbols, and parses the check amount. Accuracy should remain high despite countless variations in check layout, writing style or amount grammar. From an engineering perspective, one must design components for locating the amount, segmenting characters, recognizing digits, and parsing the amount text. Yet it is very difficult to locate the amount without identifying that it is composed of characters that mostly resemble digits and form a meaningful check amount (not a date or a routing number). Purely sequential approaches do not work. Components must interact, form hypotheses and backtrack erroneous decisions. The orchestration is difficult to design and costly to maintain. From a statistical perspective, one seeks to estimate and compare the posterior probabilities P (Y |X) where variable X represents a check image and variable Y represents a check amount. Let us define a suitable parametric model pθ(y|x), gather data pairs (xi, yi), and maximize the likelihood ∑ i log pθ(yi|xi). Such a direct approach leads to problems of unpractical sizes. It is therefore common to manually annotate some pairs (xi, yi) with detailled information such as isolated character images T , character codes C, or sequences S of character codes. One can then model P (C|T ) and P (Y |S) and obtain components such as a character recognizer or an amount parser. The statistical perspective suggests a principled way to orchestrate the interaction of these components: let the global model pθ(y|x) be expressed as a composition of submodels such as pθ(c|t) and pθ(y|s). The submodels are first fit using the detailled data. The resulting parameters are used as a bias when fitting the global model pθ(y|x) using the initial data pairs (yi|xi). This bias can be viewed as a capacity control tool for structural risk minimization (Vapnik, 1982). Model composition works nicely with generative models where one seeks to estimate the joint density P (X,Y ) instead of the posterior P (Y |X). For instance, Hidden Markov Models (HMM) for speech recognition (Rabiner, 1989) use the decomposition
[1]
Lawrence R. Rabiner,et al.
A tutorial on hidden Markov models and selected applications in speech recognition
,
1989,
Proc. IEEE.
[2]
Fernando Pereira,et al.
Weighted Rational Transductions and their Application to Human Language Processing
,
1994,
HLT.
[3]
Yoshua Bengio,et al.
Global training of document processing systems using graph transformer networks
,
1997,
Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[4]
Yoshua Bengio,et al.
Gradient-based learning applied to document recognition
,
1998,
Proc. IEEE.
[5]
Simon Haykin,et al.
GradientBased Learning Applied to Document Recognition
,
2001
.
[6]
Andrew McCallum,et al.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data
,
2001,
ICML.
[7]
Léon Bottou,et al.
On-line learning for very large data sets
,
2005
.
[8]
V. Vapnik.
Estimation of Dependences Based on Empirical Data
,
2006
.