Statistical Inference, Occam's Razor, and Statistical Mechanics on the Space of Probability Distributions

The task of parametric model selection is cast in terms of a statistical mechanics on the space of probability distributions. Using the techniques of low-temperature expansions, I arrive at a systematic series for the Bayesian posterior probability of a model family that significantly extends known results in the literature. In particular, I arrive at a precise understanding of how Occam's razor, the principle that simpler models should be preferred until the data justify more complex models, is automatically embodied by probability theory. These results require a measure on the space of model parameters and I derive and discuss an interpretation of Jeffreys' prior distribution as a uniform prior over the distributions indexed by a family. Finally, I derive a theoretical index of the complexity of a parametric family relative to some true distribution that I call the razor of the model. The form of the razor immediately suggests several interesting questions in the theory of learning that can be studied using the techniques of statistical mechanics.

[1]  L. M. M.-T. Theory of Probability , 1929, Nature.

[2]  E. M.,et al.  Statistical Mechanics , 2021, On Complementarity.

[3]  Jorma Rissanen,et al.  Universal coding, information, prediction, and estimation , 1984, IEEE Trans. Inf. Theory.

[4]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[5]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[6]  C. S. Wallace,et al.  Estimation and Inference by Compact Coding , 1987 .

[7]  Calyampudi R. Rao,et al.  Chapter 3: Differential and Integral Geometry in Statistical Inference , 1987 .

[8]  C. Itzykson,et al.  Statistical Field Theory: Random geometry , 1989 .

[9]  L. Joseph,et al.  Bayesian Statistics: An Introduction , 1989 .

[10]  Andrew R. Barron,et al.  Information-theoretic asymptotics of Bayes methods , 1990, IEEE Trans. Inf. Theory.

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[13]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[14]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[15]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[16]  William Bialek,et al.  Optimal Real-Time Signal Processing in the Nervous System , 1993 .

[17]  W. Bialek,et al.  Statistical mechanics and visual signal processing , 1994, cond-mat/9401072.

[18]  R.R. de Ruyter Van Steveninck,et al.  Statistical adaptation and optimal estimation in movement computation by the blowfly visual system , 1994, Proceedings of IEEE International Conference on Systems, Man and Cybernetics.

[19]  Shun-ichi Amari,et al.  Network information criterion-determining the number of hidden units for an artificial neural network model , 1994, IEEE Trans. Neural Networks.

[20]  Vijay Balasubramanian,et al.  A Geometric Formulation of Occam's Razor For Inference of Parametric Distributions , 1996, adap-org/9601001.

[21]  Kenji Yamanishi,et al.  A Decision-Theoretic Extension of Stochastic Complexity and Its Applications to Learning , 1998, IEEE Trans. Inf. Theory.