Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization

A general theoretical framework for Monte Carlo averaging methods of improving regression estimates is presented with application to neural network classification and time series prediction. Given a population of regression estimators, it is shown how to construct a hybrid estimator which is as good as or better than, in the MSE sense, any estimator in the population. It is argued that the ensemble method presented has several properties: It efficiently uses all the regressors of a population--none need be discarded. It efficiently uses all the available data for training without over-fitting. It inherently performs regularization by smoothing in functional space which helps to avoid over-fitting. It utilizes local minima to construct improved estimates whereas other regression algorithms are hindered by local minima. It is ideally suited for parallel computation. It leads to a very useful and natural measure of the number of distinct estimators in a population. The optimal parameters of the ensemble estimator are given in closed form. It is shown that this result derives from the notion of convexity and can be applied to a wide variety of optimization algorithms including: Mean Square Error, a general class of $L\sb{p}$-norm cost functions, Maximum Likelihood Estimation, Maximum Entropy, Maximum Mutual Information, the Kullback-Leibler Information (Cross Entropy), Penalized Maximum Likelihood Estimation and Smoothing Splines. The connection to Bayesian Inference is discussed. Experimental results on the NIST OCR database, the Turk and Pentland human face database and sunspot time series prediction are presented which demonstrate that the ensemble method dramatically improves regression performance on real-world classification tasks.

[1]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[2]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[3]  E. Nadaraya On Estimating Regression , 1964 .

[4]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[6]  Rupert G. Miller The jackknife-a review , 1974 .

[7]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[8]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[9]  Farhad Mehran,et al.  The Generalized Jackknife Statistic , 1975 .

[10]  G. Wahba,et al.  A completely automatic french curve: fitting spline functions by cross validation , 1975 .

[11]  Glenn Shafer,et al.  A Mathematical Theory of Evidence , 2020, A Mathematical Theory of Evidence.

[12]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[13]  M. Stone Asymptotics for and against cross-validation , 1977 .

[14]  H. Tong,et al.  Threshold Autoregression, Limit Cycles and Cyclical Data , 1980 .

[15]  B. Efron,et al.  The Jackknife Estimate of Variance , 1981 .

[16]  Hrishikesh D. Vinod,et al.  Recent Advances in Regression Methods. , 1983 .

[17]  D. B. Preston Spectral Analysis and Time Series , 1983 .

[18]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[19]  W. M. Carey,et al.  Digital spectral analysis: with applications , 1986 .

[20]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[21]  L. Devroye A Course in Density Estimation , 1987 .

[22]  A. Lapedes,et al.  Nonlinear signal processing using neural networks: Prediction and system modelling , 1987 .

[23]  Leon N. Cooper,et al.  Pattern Class Degeneracy in an Unrestricted Storage Density Memory , 1987, NIPS.

[24]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[25]  Glenn Shafer,et al.  Implementing Dempster's Rule for Hierarchical Evidence , 1987, Artif. Intell..

[26]  D. Ruppert,et al.  Transformation and Weighting in Regression , 1988 .

[27]  Yaser S. Abu-Mostafa,et al.  On the K-Winners-Take-All Network , 1988, NIPS.

[28]  Isabelle Guyon,et al.  Neural Network Recognizer for Hand-Written Zip Code Digits , 1988, NIPS.

[29]  Henri H. Arsenault,et al.  Improving The Performance Of Neural Networks , 1988, Photonics West - Lasers and Applications in Science and Engineering.

[30]  M. B. Priestley,et al.  Non-linear and non-stationary time series analysis , 1990 .

[31]  John E. Moody,et al.  Fast Learning in Multi-Resolution Hierarchies , 1988, NIPS.

[32]  David S. Touretzky Analyzing the Energy Landscapes of Distributed Winner-Take-All Networks , 1988, NIPS.

[33]  Ralph Linsker,et al.  An Application of the Principle of Maximum Information Preservation to Linear Systems , 1988, NIPS.

[34]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[35]  I. Guyon,et al.  Handwritten digit recognition: applications of neural network chips and automatic learning , 1989, IEEE Communications Magazine.

[36]  Y. Le Cun,et al.  Comparing different neural network architectures for classifying handwritten digits , 1989, International 1989 Joint Conference on Neural Networks.

[37]  Hervé Bourlard,et al.  Generalization and Parameter Estimation in Feedforward Netws: Some Experiments , 1989, NIPS.

[38]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[39]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[40]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[41]  Josef Skrzypek,et al.  Synergy of Clustering Multiple Back Propagation Networks , 1989, NIPS.

[42]  Geoffrey E. Hinton,et al.  Discovering High Order Features with Mean Field Modules , 1989, NIPS.

[43]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[44]  Pierre Baldi,et al.  Computing with Arrays of Bell-Shaped and Sigmoid Functions , 1990, NIPS.

[45]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[46]  Garrison W. Cottrell,et al.  EMPATH: Face, Emotion, and Gender Recognition Using Holons , 1990, NIPS.

[47]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[48]  G. Wahba Spline models for observational data , 1990 .

[49]  James D. Keeler,et al.  Integrated Segmentation and Recognition of Hand-Printed Numerals , 1990, NIPS.

[50]  S. Hanson,et al.  Spherical Units as Dynamic Consequential Regions: Implications for Attention, Competition and Categorization , 1990, NIPS 1990.

[51]  Stephen Cox,et al.  RecNorm: Simultaneous Normalisation and Classification Applied to Speech Recognition , 1990, NIPS.

[52]  Barak A. Pearlmutter,et al.  Chaitin-Kolmogorov Complexity and Generalization in Neural Networks , 1990, NIPS.

[53]  Terrence J. Sejnowski,et al.  SEXNET: A Neural Network Identifies Sex From Human Faces , 1990, NIPS.

[54]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[55]  Thomas Cover Learning and generalization , 1991, COLT '91.

[56]  Gale Martin,et al.  Recognizing Overlapping Hand-Printed Characters by Centered-Object Integrated Segmentation and Recognition , 1991, NIPS.

[57]  Christopher L. Scofield,et al.  Multiple neural net architectures for character recognition , 1991, COMPCON Spring '91 Digest of Papers.

[58]  Petri Koistinen,et al.  Kernel regression and backpropagation training with noise , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[59]  Yann LeCun,et al.  Multi-Digit Recognition Using a Space Displacement Neural Network , 1991, NIPS.

[60]  John E. Moody,et al.  Principled Architecture Selection for Neural Networks: Application to Corporate Bond Rating Prediction , 1991, NIPS.

[61]  Clark,et al.  Relative entropy and learning rules. , 1991, Physical review. A, Atomic, molecular, and optical physics.

[62]  Pierre Baldi,et al.  Temporal Evolution of Generalization during Learning in Linear Networks , 1991, Neural Computation.

[63]  David J. Montana,et al.  A Weighted Probabilistic Neural Network , 1991, NIPS.

[64]  Andrew W. Moore,et al.  Fast, Robust Adaptive Control by Learning only Forward Models , 1991, NIPS.

[65]  M. P. Perrone A novel recursive partitioning criterion , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[66]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[67]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[68]  Radford M. Neal Bayesian Mixture Modeling by Monte Carlo Simulation , 1991 .

[69]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[70]  James D. Keeler,et al.  A Self-Organizing Integrated Segmentation and Recognition Neural Net , 1991, NIPS.

[71]  Nathan Intrator,et al.  Unsupervised splitting rules for neural tree classifiers , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[72]  Cris Koutsougeras,et al.  Complex domain backpropagation , 1992 .

[73]  Yehezkel Yeshurun,et al.  Robust detection of facial features by generalized symmetry , 1992, [1992] Proceedings. 11th IAPR International Conference on Pattern Recognition.

[74]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1992, Math. Control. Signals Syst..

[75]  Keith Hjelmstad,et al.  Self-organization of architecture by simulated hierarchical adaptive random partitioning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[76]  Lars Kai Hansen,et al.  Ensemble methods for handwritten digit recognition , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[77]  Harris Drucker,et al.  Improving Performance in Neural Networks Using a Boosting Algorithm , 1992, NIPS.

[78]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[79]  R. Mammone,et al.  Neural tree networks , 1992 .

[80]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[81]  G. A. Mikhaĭlov,et al.  Optimization of Weighted Monte Carlo Methods , 1992 .

[82]  W. Härdle Applied Nonparametric Regression , 1992 .

[83]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[84]  Leon N. Cooper Hybrid neural network architectures: equilibrium systems that pay attention , 1992 .

[85]  M. P. Perrone,et al.  A soft-competitive splitting rule for adaptive tree-structured neural networks , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[86]  Radford M. Neal Bayesian Learning via Stochastic Dynamics , 1992, NIPS.

[87]  Ferdinand Hergert,et al.  Extended Regularization Methods for Nonconvergent Model Selection , 1992, NIPS.

[88]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[89]  Lokendra Shastri,et al.  Character Recognition Using A Modular Spatiotemporal Connectionist Model , 1992 .

[90]  Timothy Masters,et al.  Multilayer Feedforward Networks , 1993 .

[91]  R. LePage,et al.  Exploring the Limits of Bootstrap , 1993 .

[92]  Harris Drucker,et al.  Boosting Performance in Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[93]  Nathan Intrator,et al.  Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks , 1993, Neural Computation.

[94]  David L. Elliott,et al.  A Better Activation Function for Artificial Neural Networks , 1993 .

[95]  A. Money,et al.  Nonlinear Lp-Norm Estimation , 2020 .