Learning hierarchical categories in deep neural networks

Learning hierarchical category structure in deep neural networks Andrew M. Saxe (asaxe@stanford.edu) Department of Electrical Engineering James L. McClelland (mcclelland@stanford.edu) Department of Psychology Surya Ganguli (sganguli@stanford.edu) Department of Applied Physics Stanford University, Stanford, CA 94305 USA Abstract Psychological experiments have revealed remarkable regulari- ties in the developmental time course of cognition. Infants gen- erally acquire broad categorical distinctions (i.e., plant/animal) before finer ones (i.e., bird/fish), and periods of little change are often punctuated by stage-like transitions. This pattern of progressive differentiation has also been seen in neural net- work models as they learn from exposure to training data. Our work explains why the networks exhibit these phenomena. We find solutions to the dynamics of error-correcting learning in linear three layer neural networks. These solutions link the statistics of the training set and the dynamics of learning in the network, and characterize formally how learning leads to the emergence of structured representations for arbitrary training environments. We then consider training a neural network on data generated by a hierarchically structured probabilistic gen- erative process. Our results reveal that, for a broad class of such structures, the learning dynamics must exhibit progres- sive, coarse-to-fine differentiation with stage-like transitions punctuating longer dormant periods. Keywords: neural networks; hierarchical generative models; semantic cognition; learning dynamics Introduction Our world is characterized by a rich, nested hierarchical structure of categories within categories, and one of the most remarkable aspects of human semantic development is our ability to learn and exploit this structure. Experimental work has shown that infants and children acquire broad categorical distinctions before fine categorical distinctions (Keil, 1979; Mandler & McDonough, 1993), suggesting that human cat- egory learning is marked by a progressive differentiation of concepts from broad to fine. Furthermore, humans can ex- hibit stage-like transitions as they learn, rapidly progress- ing through successive levels of mastery (Inhelder & Piaget, 1958; Siegler, 1976). Many neural network simulations have captured aspects of these broad patterns of semantic development (Rogers & Mc- Clelland, 2004; Rumelhart & Todd, 1993; McClelland, 1995; Plunkett & Sinha, 1992; Quinn & Johnson, 1997). The inter- nal representations of such networks exhibit both progressive differentiation and stage-like transitions. However, the the- oretical basis for the ability of neuronal networks to exhibit such strikingly rich nonlinear behavior remains elusive. What are the essential principles that underly such behavior? What aspects of statistical structure in the input are responsible for driving such dynamics? For example, must networks exploit nonlinearities in their input-output map to detect higher order statistical regularities to drive such learning? W 32 y ∈ R N 3 h ∈ R N 2 W 21 x ∈ R N 1 Figure 1: The three layer network analyzed in this work. Here we analyze the learning dynamics of a linear three layer network and find, surprisingly, that it can exhibit highly nonlinear learning dynamics, including rapid stage-like tran- sitions. Furthermore, when exposed to hierarchically struc- tured data sampled from a hierarchical probabilistic model, the network exhibits progressive differentiation of concepts from broad to fine. Since such linear networks are sensitive only to the second order statistics of inputs and outputs, this yields the intriguing result that merely second order patterns of covariation in hierarchically structured data contain statis- tical signals powerful enough to drive certain nontrivial, high level aspects of semantic development in deep networks. We outline our approach here in brief. We begin by de- composing the training set to identify important dimensions of variation using the singular value decomposition (SVD), which will turn out to be fundamental to our analysis. Next, we examine the equations governing gradient descent learn- ing and show that they can be solved in terms of the SVD of the training set. This solution analytically expresses the weight values of the neural network at any point in time dur- ing learning as a function of the input training set. Finally, we consider generating the training set from a hierarchical prob- abilistic generative model. We analytically calculate the SVD of training sets so generated, which in combination with our previous results gives a formal grounding for how neural net- works will learn about hierarchical categorical structure. We show that networks must exhibit progressive differentiation of categorical structure and stage-like transitions for any train- ing set generated by a class of hierarchical generative models. Decomposing the training set Our fundamental goal is to understand the dynamics of learn- ing in neural networks as a function of the training set. To- ward this goal, in this section we introduce the singular