Accelerated Learning in Back-propagation Nets

Two of the most serious problems with back-propagation (bp) (Werbos, 1974)(Parker, 1985)(Rumelhart et al., 1986)(Almeida, 1987) are insuucient speed and the danger of getting stuck in local minima. We ooer an approach to cope with both of these problems: Instead of using bp to nd zero-points of the gradient of the error-surface we are looking for zero-points of the error-surface itself. This can be done with less computational eeort than there is in second order methods. Experimental results indicate that in cases where only a small fraction of units is active simultaneously (sparse coding), this method can be applied successfully. Furthermore it can be signiicantly faster than conventional bp. 1 The Method Numerous gradient descent methods for adjusting weights in neural nets are described in the literature (see e.g. articles by Parker, Dahl, and Watrous in IEEE 1st Int. Conf. on Neural Networks, Vol. 2). Common to all of these methods is that they try to nd a zero-point of the gradient of the error-surface, hoping that the corresponding local minimum is`global enough' to be acceptable. The surprising fact which is only poorly understood so far is that this works in some cases. What we are really interested in is not so much the local minima but the zero-points of the error-function E (or points close to zero). E is the sum of all errors E p , where E p is the error caused by some particular pattern p. Back propagation gives us the gradient @Ep @ ~ w where ~ w = (w 1 ; ::; w n) is the complete weight vector of the system. In order to implement gradient descent we have to change the weights according to 4~ w = ? @Ep @ ~ w , where is a proportionality factor, the learning rate. What we need is a good choice for. Our approach will be to compute dynamically during each pattern presentation. will be chosen such that the changed weight vector ^ ~ w = ~ w + 4~ w points to the intersection of the weight hyper-plane (in n + 1-dimensional weight-error-space) and the line deened by the current error and the current gradient. The basic assumption is that the error function can be locally approximated by its tangential hyper plane. Thus we use the gradient to do linear extrapolation, in order to gain a new weight vector whose corresponding E p is …