Learning Coefficients of Layered Models When the True Distribution Mismatches the Singularities

Hierarchical learning machines such as layered neural networks have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, with the result that the conventional learning theory of regular statistical models does not hold. Recently, it was proved that if the parameter of the true distribution is contained in the singularities of the learning machine, the generalization error in Bayes estimation is asymptotically equal to/n, where 2 is smaller than the dimension of the parameter andn is the number of training samples. However, the constant strongly depends on the local geometrical structure of singularities; hence, the generalization error is not yet clarified when the true distribution is almost but not completely contained in the singularities. In this article, in order to analyze such cases, we study the Bayes generalization error under the condition that the Kullback distance of the true distribution from the distribution represented by singularities is in proportion to 1/n and show two results. First, if the dimension of the parameter from inputs to hidden units is not larger than three, then there exists a region of true parameters such that the generalization error is larger than that of the corresponding regular model. Second, if the dimension from inputs to hidden units is larger than three, then for arbitrary true distribution, the generalization error is smaller than that of the corresponding regular model.

[1]  Sumio Watanabe Algebraic geometrical methods for hierarchical learning machines , 2001, Neural Networks.

[2]  Sumio Watanabe Algebraic Information Geometry for Learning Machines with Singularities , 2000, NIPS.

[3]  B. Efron,et al.  Stein's Estimation Rule and Its Competitors- An Empirical Bayes Approach , 1973 .

[4]  Hirotugu Akaike,et al.  Likelihood and the Bayes procedure , 1980 .

[5]  Sumio Watanabe Algebraic Analysis for Singular Statistical Estimation , 1999, ALT.

[6]  H. Hironaka Resolution of Singularities of an Algebraic Variety Over a Field of Characteristic Zero: II , 1964 .

[7]  Shun-ichi Amari,et al.  The Effect of Singularities in a Learning Machine when the True Parameters Do Not Lie on such Singularities , 2002, NIPS.

[8]  J. Hartigan A failure of likelihood asymptotics for normal mixtures , 1985 .

[9]  R. Kass,et al.  Approximate Bayesian Inference in Conditionally Independent Hierarchical Models (Parametric Empirical Bayes Models) , 1989 .

[10]  J. Neyman,et al.  INADMISSIBILITY OF THE USUAL ESTIMATOR FOR THE MEAN OF A MULTIVARIATE NORMAL DISTRIBUTION , 2005 .

[11]  Keisuke Yamazaki,et al.  Resolution of singularities in mixture models and its stochastic complexity , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[12]  Sumio Watanabe,et al.  Algebraic Analysis for Nonidentifiable Learning Machines , 2001, Neural Computation.

[13]  Dan Geiger,et al.  Asymptotic Model Selection for Naive Bayesian Networks , 2002, J. Mach. Learn. Res..

[14]  Shun-ichi Amari,et al.  Geometrical Singularities in the Neuromanifold of Multilayer Perceptrons , 2001, NIPS.

[15]  Sumio Watanabe,et al.  Learning efficiency of redundant neural networks in Bayesian estimation , 2001, IEEE Trans. Neural Networks.

[16]  Katsuyuki Hagiwara On the Problem in Model Selection of Neural Network Regression in Overrealizable Scenario , 2002, Neural Computation.

[17]  Michael Atiyah,et al.  Resolution of Singularities and Division of Distributions , 1970 .