Large Automatic Learning, Rule Extraction, and Generalization

Since an tiquity, man has dreamed of building a de vice that would "learn from examples" 1 "form generalizations", and "discover t he rules" behind patt ern s in t he data. Recent work has shown that a high ly connected , layered networ k of simple an alog processing element s can be astonishingly successful at this, in some cases . In ord er to be precise about what has been observed, we give defini t ions of memorization, generalization , and rule ex traction. T he most im portant part of this paper proposes a way to measure th e ent ropy or information content of a learni ng task a nd the effi ciency wit h which a network ext racts informat ion from the dat a. We also a rgue that the way in which the ne tworks ca n compactly represent a wid e class of Boolean (an d othe r) functi ons is analogous to t he way in which polynomials or other famili es of functions can be "curve fit" to gene ral data; specifically, they ex tend the domain, a nd average noisy data. Alas , findi ng a suitable rep rese ntation is generall y an ill-posed and ill-cond itio ned problem. E ven whe n the problem has bee n " regularized", what rem ain s is a difficult combinatoria l opt imizatio n problem. Whe n a network is given mo re resou rces than the mi nimu m needed to solve a given t ask , the symmetric, low-order , local solut ions that hum an s see m to pre fer are not the ones that the network chooses from th e vast number of solut ions avai la ble; ind eed , th e generalized delt a method a nd similar learning procedures do not usually hold t he "human " solut ions stable against perturbations. Fortuna tely, the re are © 1987 Comp lex Systems Publications, Inc. 878 Denker, Schwart z, Wittner, Solla, Howard , J ackel, and Hopfield ways of "program ming" into t he networ k a preference for appropriately chosen symmetries . 1. Overview of the contents Section 2 gives seve ral examples that illustra te t he import ance of automatic learning from examples . Section 3 poses a tes t -case problem ("c l umps") which will be used t hroughout the paper to illustrate the issues of interest. Section 4 describes the class of networks we are considering and introdu ces t he notation. Section 5 presents a proof by construction t hat a two-layer network can rep resent any Boolean function, and section 6 shows t hat there is an elegant representation for the c lumps tas k, using very few weights and processing units. Sections 7 an d 8 argue that the ob jective function E(W ) has a complicated st ruct ure: good solutions are generally not points in W space, bu t rat her parameteri zed fam ilies of points. Furt hermore, in all but the simplest sit uations, the E su rface is riddled with local minim a, and any automatic lear ning procedure must take firm measures to deal with t his. Section 9 shows that our c l umps tas k is a very simple prob lem, accordin g to the various schemes that have been proposed to quantify the complexity of network tasks and solut ions. Section 10 shows that a general network does no t prefer t he simple solut ions t hat hum ans seem to prefer. Sect ion 11 discusses the crucial effect of changes of representation on the feasibility of aut oma t ic learni ng. We prove that "automat ic learn ing will always succeed, given t he right preprocessor," but we also show t hat t his statement is grossly misleading since there is no automati c procedure for const ruct ing the requ ired preprocessor. Sections 12 and 13 propose definit ions of rule ext ract ion and genera liza t ion and emphas ize th e disti nction between th e two. Sect ion 14 calculates th e entropy budget for ru le ext ract ion and est imates the informat ion available from the t rain ing data and from the "programming" or "architecture" of t he network. This leads to an ap proximate express ion for t he efficiency with which the learni ng procedu re ext rac ts infor mat ion from t he t ra ining data. Sect ion 16 presents a simple model which allows us to calculate the erro r rate duri ng t he learn ing process. Sect ion 17 discusses the rela t ionship bet ween rule ext ract ion in general and assoc iat ive memo ry in particular . In sect ion 18, we arg ue that when special informat ion is availabl e, such as infor mation about the symmetry, geomet ry, or topology of the task at hand, the netwo rk must be provided this information. We also discuss various ways in which this informat ion can be "programmed" into t he net wor k. Section 19 dr aws the analogy between th e family of functions t hat can be implemented by networks with limited amounts of resour ces and other families of funct ions such as polynomials of limited degree. App endix A contains detai ls of th e condit ions under which our data was taken. Large Automa.tic Learning) Rule Extraction, and Generaliza.tion 879 2. Why lea r n from examples? Automa t ic learning from exa mples is a top ic of enormo us importan ce. There are many application s where there is no ot her way to approach the task. For example, consider th e problem of recognizing hand-wri t ten characters. The raw image can be fed to a preprocessor that will detect salient fea tures such as straight line segments, arcs, terminations, et c., in various parts of the field. But what then? Th ere is no mathematical expression t hat will tell you what features correspo nd to a "7" or a "Q" . The task is defined purely by th e statist ics of what features convent iona lly go with what meaningt here is no ot her definition. T here is no way to prog ram it ; the solut ion must be learned by examp les [6,11]. Another example is the task of producing the correct pronunciation of a segment of written English . There are pattern s and rules of pron unciation , but th ey are so complex that a network th at could "discover t he rules" on its own would save an enormous amount of labor [37J. Another example concerns clinical medicine: t he task of mapping a set of symptoms onto a diagnosis. Here t he inputs have physical meaningth ey are not purely convent iona l as in the previous exa mplesbut we are st ill a long way from writing down an equat ion or a computer program that will perform the task a priori. We must learn from the statist ics of past exa mp les (41). Other examples include classifying sonar returns [10], recogni zing speech [5,16,30,23], and predi cting the secondary st ruct ure of proteins from the primary sequence [42]. In th e foregoing examples, t here was rea lly no alte rnat ive to learni ng from exa mples. However, in order to learn more about the power and limit ations of var ious learnin g methods and evaluate new methods as they are prop osed , people have st udied a number of "test cases" where t here was an alternativeth at is, where the "correct" solut ion was well understood. T hese includ e classifying input pattern s accord ing to th eir parity [33], geometric shape [33,35], or spatial symmetry [36J. 3. Example : tvo-or-more clumps Th e tes t case that we will use throughout t his pap er is a simple geometric task which an adaptive network ought to be able to handle. Th e network's inp ut pattern s will be Nbit binary st rings. Somet imes we will tr ea t the pattern s as numbers, so we can speak of numerical order ; somet imes we will also treat them as one-dimensional images, in which false bits (Fs) repr esent white pixels and true bits (Ts) rep resent black pixels. A cont iguous clump of T s represents a solid black bar . We th en choose the following rule to determine th e desired output of the network, as shown in table 1: if the inpu t pattern is such that all the T s appear in one cont iguous clump , th en th e output should be F , and if there are two or more dumps, th en t he 880 Denker, Schwartz, Wittner, Solla, Howard, Jackel, and Hopfield Input pattern Outpu t Interpretation ffft ttffff F 1 clump fffttftfff T 2 clumps ftt ttttttt F 1 clump tttffttff t T 3 clumps ffffffffff F no clumps Tabl e 1: Exa mples of the t woor-more clumps predicate. output should be T . We call this t he two-or-more clumps predicate.1 We will consider numerous variat ions of t his problem, such as three-versus -two clumps and so for t h. The one-versus-two clumps version is a lso known as t he contiguity predi cate [25]. Questions of connectedness have played an importan t role in the history of network s and automatic learning: Minsky a nd P ap er t devoted a sizable por t ion of t hei r book [27] to this sort of qu est ion. There a re a host of important questions that immedi a tely a rise, some of whi ch are list ed below. In some cases , we give summary answe rs ; the details of t he an swers will be given in following sections . Ca n any network of t he type we are con sid ering actua lly rep resen t such a fu nct ion? (Yes.) This is not a t rivial resu lt , since Minsky and Paper t [27J showed that a Perce ptron (with one layer of adjustable weight s) absolutely could not perform a wide class of functions, and our fun ction is in th is class. Can it perform the funct ion efficient ly? (Yes .) This is in cont ras t, say, to a solut ion of the par ity function usin g a standard programmable logic array (PLA) [26], which is possibl e but requires enormo us numbers of hardware components (O(2N ) gates). Can the net work learn to perform this function , by learn ing from examples? (Yes.) How qui ckly can it learn it ? (It de pen ds; see below.) How many layers are required , an d how many hidden units in eac h layer? How do t he answers to t he prev ious ques t ions de pen d on t he architecture (i.e. size an d shape) of th e network? How sensit ive a re the resul t s to t he num erical me t hods and other details of the implementation , such as t he an alog represe ntation of T and F, "moment um term s" , "weight decay te rms" , step size, et c.? Does t he solut ion (i. e. the configuration of weights) t hat the net work find s make sense? Is it s imilar to the solut ions t hat human s would choose , given t he task of designing