Comparing adaptive and non-adaptive connection pruning with pure early stopping

|Neural network pruning methods on the level of individual network parameters (e.g. connection weights) can improve generalization, as is shown in this empirical study. However, an open problem in the pruning methods known today (OBD, OBS, autoprune, epsiprune) is the selection of the number of parameters to be removed in each pruning step (pruning strength). This work presents a pruning method lprune that automatically adapts the pruning strength to the evolution of weights and loss of generalization during training. The method requires no algorithm parameter adjustment by the user. Results of statistical signi cance tests comparing autoprune, lprune, and static networks with early stopping are given, based on extensive experimentation with 14 di erent problems. The results indicate that training with pruning is often signi cantly better and rarely signi cantly worse than training with early stopping without pruning. Furthermore, lprune is often superior to autoprune (which is superior to OBD) on diagnosis tasks unless severe pruning early in the training process is required. 1 Pruning and Generalization The principal idea of pruning is to reduce the number of free parameters in the network by removing dispensable ones. Pruning methods usually either remove complete input or hidden nodes along with all their associated parameters or remove individual connections, each of which carries one free parameter (the weight). This latter approach is very ne-grained and makes pruning particularly powerful. If applied properly, pruning often reduces over tting and improves generalization. At the same time it produces a smaller network. Interestingly, most papers on pruning algorithms do show empirically that smaller networks can be obtained without loss of generalization, but do not show that generalization will often be improved compared to reasonable static-network training methods. The present paper makes up for that. 1.1 Related Work: Some Known Pruning Methods The key to pruning is a method to calculate the approximate importance of each parameter. Several such methods have been suggested. The simplest one |with obvious aws [3] | is to assume the importance to be proportional to the magnitude of a weight. More sophisticated approaches are the well-known optimal brain damage (OBD) and optimal brain surgeon (OBS) methods. OBD [1] uses an approximation to the second derivative of the error with respect to each weight to determine the saliency of the removal of that weight. Low saliency means low importance of a weight. OBS [5] avoids the drawbacks of the approximation by computing the second derivatives (almost) exactly, but is computationally very expensive. Both methods have the disadvantage of requiring training to the error minimumbefore pruning may occur. For many problems, this introduces massive over tting which often cannot be repaired by subsequent pruning. The autoprune method [3] avoids this problem. Its weight importance coe cients are de ned by a test statistic T for the assumption that a weight becomes zero during the training process: T (wi) = log 0 @ Pp wi (@E=@wi)p qP p((@E=@wi)p (@E=@wi)) 2 1 A In contrast to OBD and OBS, this measure does not assume an error minimum has been reached; it can be computed at any time during training. In the above formula, sums are over all examples p of the training set, is the learning rate, and the overline means arithmetic mean over the examples. A large value of T indicates high importance of the connection with weight wi. Connections with small T can be pruned. [3] have convincingly shown autoprune to be superior to OBD. Note that many more pruning methods than discussed here have been proposed in the literature. In particular, Bayesian methods can unify the notions of regularization and pruning [11]. 1.2 An Open Problem: How Much To Prune? Given the importance T of each weight at any time during training, two questions remain to be answered: 1. When should we prune? 2. How many connections should be removed in the next pruning step? The rst question is simple to answer: For OBD and OBS, pruning occurs when minimum training set error has been reached. For autoprune, pruning occurs when over tting begins (here: when the validation set error increased twice during training; see below). The second question, however, has not yet been answered satisfactorily. The authors of OBD suggest to delete \some" parameters. The authors of autoprune at least suggest a concrete pruning schedule: remove 35% of all parameters in the rst pruning step and 10% in each following step. Such rules of thumb, however, are not satisfying, because obviously they cannot always be optimal. The following section presents a pruning method, called lprune, based on autoprune that tries to solve the problem. It computes the pruning schedule dynamically during training, adapting to the evolution of the weights and to the amount of over tting observed. 2 Adaptive Pruning Schedules: The lprune Method