High-Accuracy Object Recognition with a New Convolutional Net Architecture and Learning Algorithm

Purely supervised Convolutional Networks yield excellent accuracy on image recognition tasks when data is plentiful [1]. But until now, they have not produced state-of-the-art accuracy on object recognition benchmarks for which few labeled samp les er category are available. For example, on the popular Caltech-101 dataset with 30 samp les for each of the 101 categories, methods that use hand-designed features, such as SI FT and Geometric Blur combined with a kernel classifier, achieve accuracies of 66.2% [5], an d 64.6% [6]. By contrast, a purely supervised convolutional network with standard sigmoid no n-linearities yields only 26%. This abstract describes a modified ConvNet architecture with a ne w unsupervised/supervised training procedure that can reach 67.2% accuracy on Caltech-101. This work explores several architectural designs and train ing methods and studies their effect on the accuracy for object recognition. The convolut i nal network under consideration takes a 143x143 grayscale image as input. The preprocessing includes removing mean and performing a local contrast normalization (dividing each p ixel by the standard deviation of its neighbors). The first stage has 64 filters of size 9x9, followe d by a subsampling layer with 5x5 stride, and 10x10 averaging window. The second stage has 256 feature map, each with 16 filters connected to a random subset of first-layer feature ma ps. The subsampling layer has a stride of 4x4 and a 6x6 averaging window. Hence, the input to t he last layer has 256 feature maps of size 4x4 (4096 dimensions). Figure 1 shows the outlin e of a convolutional net, and figure 2 shows the best sequence of transformations at each st age of the network. The results are shown in table . The most surprising result is that simply adding an absolute value after the hyperbolic tangent (tanh) non-linearity pr actically doubles the recognition rate from 26% to 58% with purely supervised training. We conjectu re that the advantage of a rectifying non-linearity is to remove redundant informati on (the polarity of features), and at the same time, to avoids cancellations of neighboring opposite filter responses in the subsampling layers. Adding a local contrast normalization step after ea ch feature extraction layer [4] further improves the accuracy to 60%. The second interesting result is that pre-training each sta ge one after the other using a new unsupervised method, and adjusting the resulting network u sing supervised gradient descent bumps up the accuracy to 67.2%. The procedure is reminiscent of several recent proposal for “deep learning” [2, 3]. Our layer-wise unsupervised traini ng method is called Predictive Sparse Decomposition (PSD). It consist in learning an overcomplet e s t of basis functions from which

[1]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[2]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[3]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[4]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Jitendra Malik,et al.  SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Marc'Aurelio Ranzato,et al.  Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.