Hopfield Networks is All You Need

We show that the transformer attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns, converges with one update, and has exponentially small retrieval errors. The number of stored patterns is traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update): (1) global fixed point averaging over all patterns, (2) metastable states averaging over a subset of patterns, and (3) fixed points which store a single pattern. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal for metastable states, is uniformly distributed for global averaging, and vanishes for a fixed point near a stored pattern. Using the Hopfield network interpretation, we analyzed learning of transformer and BERT models. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging, e.g. our proposed Gaussian weighting. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem to be a promising target for improving transformers. Neural networks with Hopfield networks outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns. We provide a new PyTorch layer called "Hopfield", which allows to equip deep learning architectures with modern Hopfield networks as a new powerful concept comprising pooling, memory, and attention. GitHub: this https URL

[1]  Stephen P. Boyd,et al.  Variations and extension of the convex–concave procedure , 2016 .

[2]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[3]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[4]  Jehoshua Bruck,et al.  On the number of spurious memories in the Hopfield model , 1990, IEEE Trans. Inf. Theory.

[5]  Jianqing Fan,et al.  Distributions of angles in random packing on spheres , 2013, J. Mach. Learn. Res..

[6]  Nathaniel Virgo,et al.  Permutation-equivariant neural networks applied to dynamics prediction , 2016, ArXiv.

[7]  Jakub M. Tomczak,et al.  Deep multiple instance learning for digital histopathology , 2020 .

[8]  Pascal Koiran Dynamics of Discrete Time, Continuous State Hopfield Networks , 1994, Neural Computation.

[9]  Max Welling,et al.  Attention-based Deep Multiple Instance Learning , 2018, ICML.

[10]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[11]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[12]  John J. Hopfield,et al.  Dense Associative Memory Is Robust to Adversarial Inputs , 2017, Neural Computation.

[13]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[14]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[15]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[16]  Carlos Guestrin,et al.  Set Distribution Networks: a Generative Model for Sets of Images , 2020, ArXiv.

[17]  Alan L. Yuille,et al.  The Concave-Convex Procedure (CCCP) , 2001, NIPS.

[18]  Marco Loog,et al.  Dissimilarity-Based Ensembles for Multiple Instance Learning , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[19]  Md. Abu Bakr Siddique,et al.  Study and Observation of the Variation of Accuracies of KNN, SVM, LMNN, ENN Algorithms on Eleven Different Datasets from UCI Machine Learning Repository , 2018, 2018 4th International Conference on Electrical Engineering and Information & Communication Technology (iCEEiCT).

[20]  Cédric R. Weber,et al.  A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding , 2019, bioRxiv.

[21]  Qing Wu,et al.  Improved Expressivity Through Dendritic Neural Networks , 2018, NeurIPS.

[22]  Yifan Xu,et al.  SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters , 2018, ECCV.

[23]  F. Tanaka,et al.  Analytic theory of the ground state properties of a spin glass. II. XY spin glass , 1980 .

[24]  Brendan J. Frey,et al.  Are Random Forests Truly the Best Classifiers? , 2016, J. Mach. Learn. Res..

[25]  Dmitry Krotov,et al.  Large Associative Memory Problem in Neurobiology and Machine Learning , 2020, ArXiv.

[26]  Yixin Chen,et al.  MILES: Multiple-Instance Learning via Embedded Instance Selection , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Eric Mjolsness,et al.  A Novel Optimizing Network Architecture with Applications , 1996, Neural Computation.

[28]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[29]  Thomas Hofmann,et al.  Support Vector Machines for Multiple-Instance Learning , 2002, NIPS.

[30]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[31]  Wei Zhang,et al.  Learning to update Auto-associative Memory in Recurrent Neural Networks for Improving Sequence Memorization , 2017, ArXiv.

[32]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[33]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[34]  Eric Granger,et al.  Multiple instance learning: A survey of problem characteristics and applications , 2016, Pattern Recognit..

[35]  Ronald F. Boisvert,et al.  NIST Handbook of Mathematical Functions , 2010 .

[36]  Jun Wang,et al.  Solving the Multiple-Instance Problem: A Lazy Learning Approach , 2000, ICML.

[37]  Santosh S. Venkatesh,et al.  The capacity of the Hopfield associative memory , 1987, IEEE Trans. Inf. Theory.

[38]  Barnabás Póczos,et al.  Deep Learning with Sets and Point Clouds , 2016, ICLR.

[39]  Tingjun Hou,et al.  Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models , 2021, Journal of Cheminformatics.

[40]  Melih Kandemir,et al.  Empowering Multiple Instance Histopathology Cancer Diagnosis by Cell Graphs , 2014, MICCAI.

[41]  Luis Pinheiro,et al.  A Bayesian Approach to in Silico Blood-Brain Barrier Penetration Modeling , 2012, J. Chem. Inf. Model..

[42]  Gert R. G. Lanckriet,et al.  On the Convergence of the Concave-Convex Procedure , 2009, NIPS.

[43]  Alan L. Yuille,et al.  Convergence Properties of the Softassign Quadratic Assignment Algorithm , 1999, Neural Computation.

[44]  A. Crisanti,et al.  Saturation Level of the Hopfield Model for Neural Network , 1986 .

[45]  Tim Rocktäschel,et al.  Frustratingly Short Attention Spans in Neural Language Modeling , 2017, ICLR.

[46]  Jürgen Schmidhuber,et al.  Learning to Reason with Third-Order Tensor Products , 2018, NeurIPS.

[47]  Lacra Pavel,et al.  On the Properties of the Softmax Function with Application in Game Theory and Reinforcement Learning , 2017, ArXiv.

[48]  John J. Hopfield,et al.  Dense Associative Memory for Pattern Recognition , 2016, NIPS.

[49]  Eric Granger,et al.  Robust multiple-instance learning ensembles using random subspace instance selection , 2016, Pattern Recognit..

[50]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[51]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[52]  Giancarlo Ruocco,et al.  On the Maximum Storage Capacity of the Hopfield Model , 2017, Frontiers Comput. Neurosci..

[53]  Matthias Löwe,et al.  On a Model of Associative Memory with Huge Storage Capacity , 2017, 1702.01929.

[54]  A. Hoorfar,et al.  INEQUALITIES ON THE LAMBERTW FUNCTION AND HYPERPOWER FUNCTION , 2008 .

[55]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Yaser S. Abu-Mostafa,et al.  Information capacity of the Hopfield model , 1985, IEEE Trans. Inf. Theory.

[58]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Davide Bacciu,et al.  Encoding-based Memory Modules for Recurrent Neural Networks , 2020, ArXiv.

[60]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[61]  Ian H. Sloan,et al.  Random Point Sets on the Sphere—Hole Radii, Covering, and Separation , 2015, Exp. Math..

[62]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[63]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[64]  Robert R. Meyer,et al.  Sufficient Conditions for the Convergence of Monotonic Mathematical Programming Algorithms , 1976, J. Comput. Syst. Sci..

[65]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[66]  Geir Kjetil Sandve,et al.  Modern Hopfield Networks and Attention for Immune Repertoire Classification , 2020, bioRxiv.

[67]  Xiaomin Luo,et al.  Pushing the boundaries of molecular representation for drug discovery with graph attention mechanism. , 2020, Journal of medicinal chemistry.

[68]  Geir Kjetil Sandve,et al.  immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking , 2019, bioRxiv.

[69]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[70]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[71]  Wenyu Liu,et al.  Revisiting multiple instance neural networks , 2016, Pattern Recognit..

[72]  Adriano Barra,et al.  A new mechanical approach to handle generalized Hopfield neural networks , 2018, Neural Networks.

[73]  Alan L. Yuille,et al.  The Concave-Convex Procedure , 2003, Neural Computation.

[74]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[75]  Alex Graves,et al.  Associative Long Short-Term Memory , 2016, ICML.

[76]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[77]  Alec Radford,et al.  Release Strategies and the Social Impacts of Language Models , 2019, ArXiv.

[78]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[79]  Vijay S. Pande,et al.  Computational Modeling of β-Secretase 1 (BACE-1) Inhibitors Using Ligand Based Approaches , 2016, J. Chem. Inf. Model..

[80]  Arthur Gretton,et al.  BRUNO: A Deep Recurrent Model for Exchangeable Data , 2018, NeurIPS.

[81]  D. J. H. Garling,et al.  Analysis on Polish Spaces and an Introduction to Optimal Transportation , 2017 .

[82]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[83]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[84]  William S. DeWitt,et al.  Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire , 2017, Nature Genetics.

[85]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[86]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[87]  Mustafa Gokce Baydogan,et al.  Bag encoding strategies in multiple instance learning problems , 2018, Inf. Sci..

[88]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[89]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[90]  Peer Bork,et al.  The SIDER database of drugs and side effects , 2015, Nucleic Acids Res..

[91]  Leila Wehbe,et al.  Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain) , 2019, NeurIPS.

[92]  Lovorka Pantic,et al.  Storage capacity of attractor neural networks with depressing synapses. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[93]  G. Wainrib,et al.  Topological and dynamical complexity of random neural networks. , 2012, Physical review letters.

[94]  Demis Hassabis,et al.  MEMO: A Deep Network for Flexible Combination of Episodic Memories , 2020, ICLR.

[95]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[96]  F. Alzahrani,et al.  Sharp bounds for the Lambert W function , 2018, Integral Transforms and Special Functions.

[97]  Christian Mazza,et al.  On the Storage Capacity of Nonlinear Neural Networks , 1997, Neural Networks.

[98]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[99]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[100]  J J Hopfield,et al.  Neurons with graded response have collective computational properties like those of two-state neurons. , 1984, Proceedings of the National Academy of Sciences of the United States of America.

[101]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[102]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[103]  E. M. L. Beale,et al.  Nonlinear Programming: A Unified Approach. , 1970 .

[104]  Christopher K. I. Williams,et al.  An isotropic Gaussian mixture can have more modes than components , 2003 .

[105]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[106]  Geoffrey E. Hinton,et al.  Using Fast Weights to Attend to the Recent Past , 2016, NIPS.

[107]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[108]  Jianfeng Gao,et al.  Enhancing the Transformer with Explicit Relational Encoding for Math Problem Solving , 2019, ArXiv.