Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss

Although deep feedforward neural networks share some characteristics with the primate visual system, a key distinction is their dynamics. Deep nets typically operate in serial stages wherein each layer completes its computation before processing begins in subsequent layers. In contrast, biological systems have cascaded dynamics: information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade o s even in feedforward architectures. We explore the consequences of biologically inspired parallel hardware by constructing cascaded ResNets in which each residual block has propagation delays but all blocks update in parallel in a stateful manner. Because information transmitted through skip connections avoids delays, the functional depth of the architecture increases over time, yielding anytime predictions that improve with internal-processing time. We introduce a temporal-di erence training loss that achieves a strictly superior speed-accuracy profile over standard losses and enables the cascaded architecture to outperform state-of-the-art anytime-prediction methods. The cascaded architecture has intriguing properties, including: it classifies typical instances more rapidly than atypical instances; it is more robust to both persistent and transient noise than is a conventional ResNet; and its time-varying output trace provides a signal that can be exploited to improve information processing and inference. Since the earliest investigations of artificial neural nets, their design has been informed by biological neural nets [37]. Perhaps the most compelling example is the convolutional net for machine vision, which has adopted properties of primate cortical neuroanatomy including a hierarchical layered organization, local receptive fields, and spatial equivariance [12]. In this article, we investigate computational consequences of two fundamental properties of biological information processing systems that have not been considered in the design of deep neural nets. First, the brain consists of massively parallel, dedicated hardware with neurons throughout the cortex updating continuously and simultaneously. Second, information transmission between neurons introduces time delays [1]. As a result, unrefined and possibly úCurrently at Apple. 35th Conference on Neural Information Processing Systems (NeurIPS 2021). incomplete neural state in one region is transmitted to the next region even as the state evolves; and feedforward connectivity yields a speed-accuracy trade o in which the initial response to a static input occurs rapidly but can be inaccurate, with the output gradually improving over internal processing time. Following McClelland [36], we refer to such an architecture as cascaded. Cascaded dynamics contrast sharply with the dynamics of standard feedforward nets, which operate in serial stages, each layer completing its computation before subsequent layers begin processing. Cascaded dynamics are also quite di erent than the dynamics of vision models with recurrent connections [e.g., 23, 25, 38, 47], which, given a static input, may iteratively update, but layer updates are still computed serially with each layer completing its computation and then feeding it immediately to the next layer (or back to itself). Fundamentally, our investigation asks: Supposing we take a step toward biological realism with massively parallel hardware and relatively slow inter-neuron communication, what are the computational benefits and consequences?2 We construct cascaded networks by introducing propagation delays in deep feedforward nets provided with a static input. We treat the net as massively parallel such that all units across all layers are updated simultaneously and iteratively. We focus on the ResNet architecture [14] and we introduce a propagation delay into each residual block (Figure 1a). Because the skip connection permits faster transmission of more primitive perceptual representations, the functional depth of the resulting architecture increases over internal-processing time, yielding a trade o between processing speed and complexity of processing. Consequently, the architecture o ers a natural, integral mechanism for making predictions at any point in processing, known as anytime prediction [58]. Speed-accuracy trade o s are a fundamental characteristic of human information processing [22, 42] and human perception has been modeled with deep learning anytime prediction methods [29]. Although we focus on the ResNet, our approach can be incorporated into any model with skip connections (e.g., Highway Nets [48], DenseNet [19], U-Net [43], Transformers [52]). The contrast between a serial, one-layer-at-a-time model and a cascaded, parallel-update model is illustrated in Figures 1b and 1c, respectively. To step through the operation of the cascaded model, at time 1, only the first residual block has received meaningful input, and the model prediction is therefore based only on this block’s computation. At time 2, all higher residual blocks have received input from block 1, and the output is therefore based on all blocks’ computations, though blocks 2 and above have deficient input. At each subsequent time, all blocks are receiving meaningful input, but it is not until time t that block t has reached its asymptotic output because its input does not stabilize until t ≠ 1. In essence, the cascaded model behaves like a WideResNet [56] on the first steps and then becomes a deep ResNet. Our work makes the following key contributions. • We demonstrate the superiority of the cascaded architecture to the serial (Figures 1b,c), indicating that parallelism can be exploited in a way that has not previously been studied. • We propose and evaluate a novel training objective aimed at improving the predictions of anytime models. This temporal-di erence (TD) loss [49] encourages the most accurate response as quickly as possible. TD training improves the performance of both cascaded and serial architectures. Although a rich literature exists aimed at reducing the number of computational steps required to obtain an accurate answer [2, 3, 4, 5, 11, 15, 16, 17, 18, 20, 24, 31, 31, 38, 40, 45, 51, 54, 57], all of this work uses a degenerate form of TD for training and our results suggest that these models can be improved using TD. • The cascaded model trained with TD (CascadedTD) tends to respond most rapidly to prototypical exemplars, whereas training with the standard cross-entropy loss (CascadedCE) does not (Figure 2). We assess with three quantitative prototypicality measures, and we further show that CascadedTD rapidly converges on the correct semantic family, whereas CascadedCE does not. These facts indicate that CascadedTD organizes knowledge di erently across layers than does CascadedCE. 2Like much other research in deep learning [25, 8], biology informs our work by providing novel forms of inductive bias. Our goal is to investigate computational consequences of these biases, not to model biological phenomena per se.

[1]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[2]  Denis G. Pelli,et al.  Anytime Prediction as a Model of Human Reaction Time , 2020, ArXiv.

[3]  Le Song,et al.  Learning to Stop While Learning to Predict , 2020, ICML.

[4]  Elisabetta Chicca,et al.  Efficient Processing of Spatio-Temporal Data Streams With Spiking Neural Networks , 2020, Frontiers in Neuroscience.

[5]  E. Baccarelli,et al.  Why Should We Add Early Exits to Neural Networks? , 2020, Cognitive Computation.

[6]  Le Yang,et al.  Resolution Adaptive Networks for Efficient Inference , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Maja Pantic,et al.  Toward fast and accurate human pose estimation via soft-gated skip connections , 2020, 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020).

[8]  Ziheng Jiang,et al.  Characterizing Structural Regularities of Labeled Data in Overparameterized Models , 2020, ICML.

[9]  Michael Auli,et al.  Depth-Adaptive Transformer , 2019, ICLR.

[10]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Nikolaus Kriegeskorte,et al.  Recurrent neural networks can explain flexible trading of speed and accuracy in biological vision , 2019, bioRxiv.

[12]  Yoram Singer,et al.  Convolutional Bipartite Attractor Networks , 2019, ArXiv.

[13]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[14]  Aran Nayebi,et al.  CORnet: Modeling the Neural Mechanisms of Core Object Recognition , 2018, bioRxiv.

[15]  Jinwoo Shin,et al.  Anytime Neural Prediction via Slicing Networks Vertically , 2018, ArXiv.

[16]  James J. DiCarlo,et al.  Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior , 2018, Nature Neuroscience.

[17]  Andrew Zisserman,et al.  Massively Parallel Video Networks , 2018, ECCV.

[18]  Jan Köhler,et al.  The streaming rollout of deep networks - towards fully model-parallel execution , 2018, NeurIPS.

[19]  Jonathon S. Hare,et al.  Deep Cascade Learning , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Martial Hebert,et al.  Anytime Neural Network: a Versatile Trade-off Between Computation and Accuracy , 2018 .

[21]  Pavlo Molchanov,et al.  IamNN: Iterative and Adaptive Mobile Neural Network for Efficient Image Classification , 2018, ICLR.

[22]  Jonathon Shlens,et al.  Recurrent Segmentation for Variable Computational Budgets , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Debadeepta Dey,et al.  Learning Anytime Predictions in Neural Networks via Adaptive Loss Balancing , 2017, AAAI.

[24]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[25]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[26]  Xin Wang,et al.  IDK Cascades: Fast Deep Learning by Learning not to Overthink , 2017, UAI.

[27]  Nikolaus Kriegeskorte,et al.  Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition , 2017, bioRxiv.

[28]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[29]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30]  Venkatesh Saligrama,et al.  Adaptive Neural Networks for Efficient Inference , 2017, ICML.

[31]  Lin Sun,et al.  Feedback Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[35]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[36]  Alex Graves,et al.  Adaptive Computation Time for Recurrent Neural Networks , 2016, ArXiv.

[37]  Jia Deng,et al.  Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Nikolaus Kriegeskorte,et al.  Deep neural networks: a new framework for modelling biological vision and brain information processing , 2015, bioRxiv.

[40]  Yinda Zhang,et al.  LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop , 2015, ArXiv.

[41]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[42]  Jürgen Schmidhuber,et al.  Highway Networks , 2015, ArXiv.

[43]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[44]  James J. DiCarlo,et al.  How Does the Brain Solve Visual Object Recognition? , 2012, Neuron.

[45]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Matt Jones,et al.  Optimal Response Initiation: Why Recent Experience Matters , 2008, NIPS.

[47]  Roger Ratcliff,et al.  The Diffusion Decision Model: Theory and Data for Two-Choice Decision Tasks , 2008, Neural Computation.

[48]  M. Masson Using confidence intervals for graphically based data interpretation. , 2003, Canadian journal of experimental psychology = Revue canadienne de psychologie experimentale.

[49]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[50]  William Bialek,et al.  Reliability and information transmission in spiking neurons , 1992, Trends in Neurosciences.

[51]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[52]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[53]  James L. McClelland On the time relations of mental processes: An examination of systems of processes in cascade. , 1979 .

[54]  W. Marsden I and J , 2012 .

[55]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[58]  D. J. Felleman,et al.  Distributed hierarchical processing in the primate cerebral cortex. , 1991, Cerebral cortex.