### 92¢ /MFlops/s, Ultra-Large-Scale Neural-Network Training on a PIII Cluster

Artificial neural networks with millions of adjustable parameters and a similar number of training examples are a potential solution for difficult, large-scale pattern recognition problems in areas such as speech and face recognition, classification of large volumes of web data, and finance. The bottleneck is that neural network training involves iterative gradient descent and is extremely computationally intensive. In this paper we present a technique for distributed training of Ultra Large Scale Neural Networks 1 (ULSNN) on Bunyip, a Linux-based cluster of 196 Pentium III processors. To illustrate ULSNN training we describe an experiment in which a neural network with 1.73 million adjustable parameters was trained to recognize machine-printed Japanese characters from a database containing 9 million training patterns. The training runs with a average performance of 163.3 GFlops/s (single precision). With a machine cost of $150,913, this yields a price/performance ratio of 92.4¢ /MFlops/s (single precision). For comparison purposes, training using double precision and the ATLAS DGEMM produces a sustained performance of 70 MFlops/s or$2.16 / MFlop/s (double precision).

[1]  V. Strassen Gaussian elimination is not optimal , 1969 .

[2]  B. Greer,et al.  High Performance Software on Intel Pentium Pro Processors or Micro-Ops to TeraFLOPS , 1997, ACM/IEEE SC 1997 Conference (SC'97).

[3]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[4]  James Demmel,et al.  Using PHiPAC to speed error back-propagation learning , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Comput..

[6]  Terrence L. Fine Feedforward Neural Network Methodology , 1999, Information Science and Statistics.

[8]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[9]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.