Technology. His Current Research Interests In- Clude Neural Computation, Statistical Learning Theory, and Handwriting Recognition. Kwok and Yeung: Constructive Algorithms for Structure Learning \some Approximation Properties of Projection Pursuit Learning Networks," in Advances in Neural Information

15 63] L.K. Jones, \A simple lemma on greedy approximation in Hilbert space and convergence rates for projection pursuit regression and neural network training," The Annals of Statistics,cursive dynamic node creation in multilayer neural networks," \Use of a quasi-Newton method in a feedforward neural network construction algorithm," IEEE Wahba, \Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation," Numerische Mathe-13 vergence is also an important theoretical yet crucial practical issue. Recent results 62], 63], 131], 132], 133] have shown that, under certain regularity conditions, the approximation error typically improves as O(1 n), where n is the number of hidden units in the network. However, as mentioned in Section II-D.3, some of these results are not applicable to constructive algorithms when a greedy approach is taken, whereas others are applicable to greedy algorithms but with the detailed conditions diierent from those used in practice. For example, in both 62] and 63], the iterative sequence of network estimates is formed from a convex combination of the previous network function f n?1 and the new hidden unit function g n , (7) where 0 n 1. In algorithms like the cascade-correlation algorithm (Section IV-C), however, the new f n is formed from full linear combination of the old and new hidden unit functions. The weights connecting the old hidden units to the output unit are thus not constrained as a group, as in (7). Besides, in (7), n must be learned together with the new hidden unit g n , while in the cascade-correlation algorithm, the parameters of the new hidden unit are rst learned and then the output layer weights. Moreover, the objective function to be minimized in 62], 63] is diierent from that in the cascade-correlation algorithm. Modifying and applying these useful theoretical results to the analysis of diierent constructive algorithms is thus beneecial. Some initial progress has been reported in 119], 120], 121]. As mentioned in Section II-C, one must decide when to stop the constructive algorithm. This is important to achieve a proper bias-variance trade-oo. A good stopping criterion will be one based on an estimate of the generalization performance of the network. But computing an accurate estimation eeciently for neural networks is not an easy problem to solve, and this will still be an important research topic. Up to now, a comprehensive performance comparison of diierent constructive algorithms is …

[1]  M. Tummala,et al.  Identification of Volterra systems with a polynomial neural network , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Gábor Lugosi,et al.  Nonparametric estimation via empirical risk minimization , 1995, IEEE Trans. Inf. Theory.

[3]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[4]  John Moody,et al.  Prediction Risk and Architecture Selection for Neural Networks , 1994 .

[5]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[6]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[7]  Jooyoung Park,et al.  Approximation and Radial-Basis-Function Networks , 1993, Neural Computation.

[8]  E. Fiesler,et al.  Comparative Bibliography of Ontogenic Neural Networks , 1994 .

[9]  James D. Keeler,et al.  Predicting the Future: Advantages of Semilocal Units , 1991, Neural Computation.

[10]  M. Golea,et al.  A Convergence Theorem for Sequential Learning in Two-Layer Perceptrons , 1990 .

[11]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[12]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[13]  Brian D. Ripley,et al.  Statistical Ideas for Selecting Network Architectures , 1995, SNN Symposium on Neural Networks.

[14]  H. Akaike A new look at the statistical model identification , 1974 .

[15]  Garrison W. Cottrell,et al.  Topology-modifying neural network algorithms , 1998 .

[16]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[17]  Blake LeBaron,et al.  Evaluating Neural Network Predictors by Bootstrapping , 1994 .

[18]  James D. Keeler,et al.  Layered Neural Networks with Gaussian Hidden Units as Universal Approximations , 1990, Neural Computation.

[19]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[20]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[21]  R. A. Silverman,et al.  Introductory Real Analysis , 1972 .

[22]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[23]  Gail Gong Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression , 1986 .

[24]  Manoel Fernando Tenorio,et al.  Self-organizing network for optimum supervised learning , 1990, IEEE Trans. Neural Networks.

[25]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[26]  S. K. Rogers,et al.  A taxonomy of neural network optimality , 1992, Proceedings of the IEEE 1992 National Aerospace and Electronics Conference@m_NAECON 1992.

[27]  統計数理研究所 Annals of the institute of statistical mathematics , 1988, Public Choice.

[28]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[29]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[30]  Michael C. Mozer,et al.  Skeletonization: A Technique for Trimming the Fat from a Network via Relevance Assessment , 1988, NIPS.

[31]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[32]  Harry Wechsler,et al.  From Statistics to Neural Networks: Theory and Pattern Recognition Applications , 1996 .

[33]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[34]  Guillaume Deffuant Neural units recruitment algorithm for generation of decision trees , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[35]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[36]  C. Jutten,et al.  Gal: Networks That Grow When They Learn and Shrink When They Forget , 1991 .

[37]  Ehud D. Karnin,et al.  A simple procedure for pruning back-propagation trained neural networks , 1990, IEEE Trans. Neural Networks.

[38]  Eric B. Baum,et al.  A Proposal for More Powerful Learning Algorithms , 1989, Neural Computation.

[39]  Kurt Hornik,et al.  Approximation capabilities of multilayer feedforward networks , 1991, Neural Networks.

[40]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[41]  Marcus Frean,et al.  The Upstart Algorithm: A Method for Constructing and Training Feedforward Neural Networks , 1990, Neural Computation.

[42]  Steve Renals Radial basis function network for speech pattern classification , 1989 .