A Review of Classification

The summarization of large quantities of multivariate data by clusters, undefined a priori, is increasingly practiced, often irrelevantly and unjusti fiably. This paper attempts to survey the burgeoning bibliography, restrict ing itself to published, freely available, references of known provenance. A plethora of definitions of similarity and of cluster are presented. The principles, but not details of implementation, of the many empirical classi fication techniques currently in use are discussed, and limitations and short comings in their development and practice are pointed out. Methods based on well-defined mathematical formulations of the problem are emphasized, and other ways of summarizing data are suggested as alternatives to classi fication. The growing tendency to regard numerical taxonomy as a satis factory alternative to clear thinking is condemned.

[1]  Raymond E. Bonner,et al.  On Some Clustering Techniques , 1964, IBM J. Res. Dev..

[2]  D. W. Goodall,et al.  A Probabilistic Similarity Index , 1964, Nature.

[3]  K. Florek,et al.  Sur la liaison et la division des points d'un ensemble fini , 1951 .

[4]  M. Hills On looking at large correlation matrices , 1969 .

[5]  Walter D. Fisher On Grouping for Maximum Homogeneity , 1958 .

[6]  R. Sokal STATISTICAL METHODS IN SYSTEMATICS* , 1965, Biological reviews of the Cambridge Philosophical Society.

[7]  G Colman,et al.  The application of computers to the classification of streptococci. , 1968, Journal of general microbiology.

[8]  Pike Mc,et al.  Disease clustering: a generalization of Knox's approach to the detection of space-time interactions. , 1968 .

[9]  László Orlóci,et al.  An Agglomerative Method for Classification of Plant Communities , 1967 .

[10]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[11]  P. Macnaughton-Smith,et al.  190. Note: The Classification of Individuals by the Possession of Attributes Associated with a Criterion , 1963 .

[12]  Martin D. Levine,et al.  An Algorithm for Detecting Unimodal Fuzzy Sets and Its Application as a Clustering Technique , 1970, IEEE Transactions on Computers.

[13]  J. Tracey,et al.  Investigation of Changes in Pasture Composition by Some Classificatory Methods , 1968 .

[14]  R. M. Needham,et al.  Automatic Classification in Linguistics , 1967 .

[15]  D J Rogers,et al.  A Computer Program for Classifying Plants. , 1960, Science.

[16]  Walter D. Fisher,et al.  Clustering and Aggregation in Economics. , 1969 .

[17]  W. L. Sawrey,et al.  An Objective Method of Grouping Profiles by Distance Functions and its Relation to Factor Analysis , 1960 .

[18]  K Hope,et al.  The Complete Analysis of a Data Matrix: Application and Interpretation , 1970, British Journal of Psychiatry.

[19]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[20]  R. Cattell A note on correlation clusters and cluster search methods , 1944 .

[21]  Paul F. Lazarsfeld,et al.  Latent Structure Analysis. , 1969 .

[22]  E. Anderson A SEMIGRAPHICAL METHOD FOR THE ANALYSIS OF COMPLEX PROBLEMS. , 1957, Proceedings of the National Academy of Sciences of the United States of America.

[23]  A. Edwards,et al.  Estimation of the Branch Points of a Branching Diffusion Process , 1970 .

[24]  G. N. Lance,et al.  A general theory of classificatory sorting strategies: II. Clustering systems , 1967, Comput. J..

[25]  T. Kurczynski,et al.  Generalized Distance and Discrete Variables , 1970 .

[26]  John W. Tukey,et al.  Unsolved Problems of Experimental Statistics , 1954 .

[27]  K. Sparck Jones,et al.  KEYWORDS AND CLUMPS , 1964 .

[28]  Geoffrey H. Ball,et al.  Data analysis in the social sciences: what about the details? , 1965, AFIPS '65 (Fall, part I).

[29]  G. Estabrook A mathematical model in graph theory for biological classification. , 1966, Journal of theoretical biology.

[30]  Louis L. McQuitty,et al.  Capabilities and Improvements of Linkage Analysis as a Clustering Method , 1964 .

[31]  P. H. A. Sneath,et al.  Some Statistical Problems in Numerical Taxonomy , 1967 .

[32]  J. T. Curtis,et al.  An Ordination of the Upland Forest Communities of Southern Wisconsin , 1957 .

[33]  R C Durfee,et al.  A METHOD OF CLUSTER ANALYSIS. , 1970, Multivariate behavioral research.

[34]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[35]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[36]  L. C. Cole,et al.  The Measurement of Interspecific Associaton , 1949 .

[37]  David Lindley,et al.  Advanced Statistical Methods in Biometric Research. , 1953 .

[38]  R. Sokal,et al.  Random Scanning of Taxonomic Characters , 1966, Nature.

[39]  Karen Spärck Jones,et al.  Current approaches to classification and clump-finding at the Cambridge Language Research Unit , 1967, Comput. J..

[40]  J. Farris On the Cophenetic Correlation Coefficient , 1969 .

[41]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[42]  Peter Ihm,et al.  AUTOMATIC CLASSIFICATION IN ANTHROPOLOGY , 1965 .

[43]  P. Jaccard Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines , 1901 .

[44]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[45]  Joe H. Ward,et al.  Application of an Hierarchical Grouping Procedure to a Problem of Grouping Profiles , 1963 .

[46]  David H. Krantz,et al.  Metrics and geodesics induced by order relations , 1967 .

[47]  D. Rogers,et al.  A Graph Theory Model for Systematic Biology, with an Example for the Oncidiinae (Orchidaceae) , 1966 .

[48]  J. Hartigan REPRESENTATION OF SIMILARITY MATRICES BY TREES , 1967 .

[49]  E G Knox,et al.  The Detection of Space‐Time Interactions , 1964 .

[50]  F. Marriott Practical problems in a method of cluster analysis. , 1971, Biometrics.

[51]  J. Carmichael,et al.  FINDING NATURAL CLUSTERS , 1968 .

[52]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[53]  Eli C. Minkoff,et al.  The Effects on Classification of Slight Alterations in Numerical Technique , 1965 .

[54]  László Orlóci,et al.  Geometric Models in Ecology: I. The Theory and Application of Some Ordination Methods , 1966 .

[55]  R. L. Thorndike Who belongs in the family? , 1953 .

[56]  N. Jardine Discussion and Correspondence Algorithms, methods and models in the simplification of complex data , 1970 .

[57]  R. Sokal,et al.  Principles of numerical taxonomy , 1965 .

[58]  P. J. Harrison,et al.  A Method of Cluster Analysis and Some Applications , 1968 .

[59]  V. Balakrishnan,et al.  DISTANCE BETWEEN POPULATIONS ON THE BASIS OF ATTRIBUTE DATA , 1968 .

[60]  Paul Constantinescu,et al.  The Classification of a Set of Elements with Respect to a Set of Properties , 1966, Computer/law journal.

[61]  J. Gower Multivariate Analysis and Multidimensional Geometry , 1967 .

[62]  George Nagy,et al.  Feature Extraction on Binary Patterns , 1969, IEEE Trans. Syst. Sci. Cybern..

[63]  R. Crawford,et al.  A Rapid Multivariate Method for the Detection and Classification of Groups of Ecologically Related Species , 1967 .

[64]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[65]  W. T. WILLIAMS,et al.  Logic of Computer-Based Intrinsic Classifications , 1965, Nature.

[66]  D N Baron,et al.  Medical applications of taxonomic methods. , 1968, British medical bulletin.

[67]  Robin Sibson A model for taxonomy. II , 1970 .

[68]  P. Bannister,et al.  An Evaluation of Some Procedures Used in Simple Ordinations , 1968 .

[69]  G. N. Lance,et al.  Note on a New Information-Statistic Classificatory Program , 1968, Comput. J..

[70]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[71]  G. N. Lance,et al.  Studies in the Numerical Analysis of Complex Rain-Forest Communities: III. The Analysis of Successional Data , 1969 .

[72]  J. C. Gower The basis of numerical methods of classification , 1969 .

[73]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[74]  John C. Gower,et al.  Classification and Geology , 1970 .

[75]  J. H. Rayner CLASSIFICATION OF SOILS BY NUMERICAL METHODS , 1966 .

[76]  Karen Spärck Jones,et al.  The use of automatically-obtained keyword classifications for information retrieval , 1969, Inf. Storage Retr..

[77]  M Mandel,et al.  New approaches to bacterial taxonomy: perspective and prospects. , 1969, Annual review of microbiology.

[78]  David C. Eades,et al.  The Inappropriateness of the Correlation Coefficient as a Measure of Taxonomic Resemblance , 1965 .

[79]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[80]  A. J. B. Anderson,et al.  Numeric examination of multivariate soil samples , 1971 .

[81]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: I. Association-Analysis in Plant Communities , 1959 .

[82]  W. E. Silver,et al.  Economics and Information Theory , 1967 .

[83]  F. James Rohlf,et al.  Robustness of Numerical Taxonomic Methods and Errors in Homology , 1969 .

[84]  J. G. Field The Use of the Information Statistic in the Numerical Classification of Heterogeneous Systems , 1969 .

[85]  L. A. Johnson,et al.  Rainbow's end: the quest for an optimal taxonomy. , 1970, Systematic zoology.

[86]  P. MacNaughton-Smith Some statistical and other numerical techniques for classifying individuals , 1966 .

[87]  Frank B. Baker,et al.  Information Retrieval Based upon Latent Class Analysis , 1962, JACM.

[88]  D. Wishart,et al.  Numerical Classification Method for deriving Natural Classes , 1969, Nature.

[89]  Isidore Eisenberger,et al.  Genesis of Bimodal Distributions , 1964 .

[90]  D. F. Grigal,et al.  Numerical Classification of Some Forested Minnesota Soils1 , 1969 .

[91]  James S. Coleman,et al.  Electronic Processing of Sociometric Data for Groups up to 1,000 in Size , 1960 .

[92]  D. M. Jackson,et al.  The Stability of Classifications of Binary Attribute Data , 1970 .

[93]  P. H. A. Sneath,et al.  RECENT TRENDS IN NUMERICAL TAXONOMY , 1969 .

[94]  Robert R. Sokal,et al.  Distance as a Measure of Taxonomic Similarity , 1961 .

[95]  C. D. Batty The Automatic Generation of Index Languages , 1969 .

[96]  D. W. Goodall Numerical taxonomy of bacteria--some published data re-examined. , 1966, Journal of general microbiology.

[97]  A. V. Hall,et al.  Avoiding Informational Distortion in Automatic Grouping Programs , 1969 .

[98]  M A Woodbury,et al.  Clinical data representation in multidimensional space. , 1970, Computers and biomedical research, an international journal.

[99]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: VI. Comparison of Information-Analysis and Association-Analysis , 1966 .

[100]  Louis L. McQuitty AGREEMENT ANALYSIS: CLASSIFYING PERSONS BY PREDOMINANT PATTERNS OF RESPONSES1 , 1956 .

[101]  W. G. Cochran,et al.  Some Classification Problems with Multivariate Qualitative Data , 1961 .

[102]  F. Rohlf Adaptive Hierarchical Clustering Schemes , 1970 .

[103]  J. Gower A comparison of some methods of cluster analysis. , 1967, Biometrics.

[104]  Anthony F. Bartholomay,et al.  The Mathematical Approach to Biology and Medicine , 1967 .

[105]  W Fernandez De La Vega Techniques de classification automatique utilisant un indice de ressemblance , 1967 .

[106]  James F. Mello,et al.  An application of cluster analysis as a method of determining biofacies , 1968 .

[107]  J. Morgan,et al.  Problems in the Analysis of Survey Data, and a Proposal , 1963 .

[108]  J. Behboodian On the Modes of a Mixture of Two Normal Distributions , 1970 .

[109]  J Zubin,et al.  ON THE METHODS AND THEORY OF CLUSTERING. , 1969, Multivariate behavioral research.

[110]  R. Lange,et al.  Experimental appraisal of certain procedures for the classification of data. , 1965, Australian journal of biological sciences.

[111]  R. Thorne,et al.  Phenetic and Phylogenetic Classification , 1964, Nature.

[112]  Robin Sibson,et al.  The Construction of Hierarchic and Non-Hierarchic Classifications , 1968, Comput. J..

[113]  A. Hall,et al.  The Peculiarity Index, a New Function for Use in Numerical Taxonomy , 1965, Nature.

[114]  R. Sokal,et al.  THE COMPARISON OF DENDROGRAMS BY OBJECTIVE METHODS , 1962 .

[115]  Chris S. Wallace,et al.  A Program for Numerical Classification , 1970, Comput. J..

[116]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. II , 1962 .

[117]  C. J. Jardine,et al.  The structure and construction of taxonomic hierarchies , 1967 .

[118]  R. Luce,et al.  Connectivity and generalized cliques in sociometric group structure , 1950, Psychometrika.

[119]  Robert C. Tryon,et al.  General Dimensions of Individual Differences: Cluster Analysis Vs. Multiple Factor Analysis , 1958 .

[120]  J. Rubin Optimal classification into groups: an approach for solving the taxonomy problem. , 1967, Journal of theoretical biology.

[121]  C. S. Wallace,et al.  An Information Measure for Classification , 1968, Comput. J..

[122]  D. Goodall,et al.  Objective methods for the classification of vegetation. III. An essay in the use of factor analysis , 1954 .

[123]  W. J. Quesne,et al.  A Method of Selection of Characters in Numerical Taxonomy , 1969 .

[124]  W. T. Williams,et al.  Angiosperm taxonomy: a comparative study of some novel numerical techniques , 1966 .

[125]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[126]  W. T. Williams,et al.  Fundamental Problems in Numerical Taxonomy , 1966 .

[127]  J. Gower Adding a point to vector diagrams in multivariate analysis , 1968 .

[128]  W. T. Williams,et al.  A Generalized Sorting Strategy for Computer Classifications , 1966, Nature.

[129]  J. M. A. Swan,et al.  THE PHYTOSOCIOLOGICAL STRUCTURE OF UPLAND FOREST AT CANDLE LAKE, SASKATCHEWAN , 1966 .

[130]  M. J. Rose,et al.  Classification of a set of elements , 1964, Comput. J..

[131]  A. J. Cole,et al.  An Improved Algorithm for the Jardine-Sibson Method of Generating Overlapping Clusters , 1970, Computer/law journal.

[132]  D. Robertson Smith,et al.  THE CYTOLOGY AND CYTOCHEMISTRY OF ACUTE LEUKAEMIAS , 1965 .

[133]  J. W. Muir,et al.  The classification of soil profiles by traditional and numerical methods , 1970 .

[134]  Robin Sibson,et al.  Some Observations on a Paper by Lance and Williams , 1971, Comput. J..

[135]  Roger M. Needham,et al.  A Method for Using Computers in Information Classification , 1962, IFIP Congress.

[136]  J. Hartigan,et al.  Percentage Points of a Test for Clusters , 1969 .

[137]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: IV. Nodal Analysis , 1962 .

[138]  L. Guttman A general nonmetric technique for finding the smallest coordinate space for a configuration of points , 1968 .

[139]  R B Cattell,et al.  Principles of behavioural taxonomy and the mathematical basis of the taxonome computer program. , 1966, The British journal of mathematical and statistical psychology.

[140]  Louis L. McQuitty,et al.  A Novel Application of the Coefficient of Correlation in the Isolation of Both Typal and Dimensional Constructs , 1967 .

[141]  W. T. WILLIAMS,et al.  Concentration of Entries in Binary Arrays , 1966, Nature.

[142]  D. W. Goodall A New Similarity Index Based on Probability , 1966 .

[143]  A. J. Willmott,et al.  Cluster analysis on the Atlas computer , 1968, Comput. J..

[144]  W. T. Williams,et al.  Multivariate Methods in Plant Ecology: V. Similarity Analyses and Information-Analysis , 1966 .

[145]  D. Goodall,et al.  Objective methods for the classification of vegetation. I. The use of positive interspecific correlation , 1953 .

[146]  R. Crawford,et al.  A Rapid Classification and Ordination Method and Its Application to Vegetation Mapping , 1968 .

[147]  I. C. Lerman,et al.  Les bases de la classification automatique , 1971 .

[148]  E. Mayr,et al.  Theory of Biological Classification , 1968, Nature.

[149]  L'etude des Communautes Vegetales par L'analyse Statistique des Liaisons Entre les Especes et les Variables Ecologiques: Principes Fondamentaux , 1965 .

[150]  K. Hope,et al.  The Complete Analysis of a Data Matrix , 1969, British Journal of Psychiatry.

[151]  R. Sokal,et al.  A METHOD FOR DEDUCING BRANCHING SEQUENCES IN PHYLOGENY , 1965 .

[152]  John C. Gower A survey of numerical methods useful in taxonomy , 1969 .

[153]  W. Kendrick,et al.  COMPUTER TAXONOMY IN THE FUNGI IMPERFECTI , 1964 .

[154]  Louis L. McQuitty Expansion of Similarity Analysis By Reciprocal Pairs for Discrete and Continuous Data , 1967 .

[155]  László Orlóci,et al.  Geometric Models in Ecology: II. An Evaluation of Some Ordination Techniques , 1966 .

[156]  I. J. Good,et al.  Speculations Concerning the First Ultraintelligent Machine , 1965, Adv. Comput..

[157]  R J BEERS,et al.  Experimental methods in computer taxonomy. , 1962, Journal of general microbiology.

[158]  D. D. Wall,et al.  Cluster Analysis of Semantic Differential Data1 , 1969 .

[159]  A. J. B. Anderson,et al.  Ordination Methods in Ecology , 1971 .

[160]  R. Shepard The analysis of proximities: Multidimensional scaling with an unknown distance function. I. , 1962 .

[161]  John C. Gower,et al.  Statistical methods of comparing different multivariate analyses of the same data , 1971 .

[162]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[163]  G. N. Lance,et al.  Controversy Concerning the Criteria for Taxonometric Strategies , 1971, Computer/law journal.

[164]  Colin White,et al.  The Mathematical Approach to Biology and Medicine , 1967, The Yale Journal of Biology and Medicine.

[165]  R. Jancey Multidimensional group analysis , 1966 .

[166]  W. T. Williams,et al.  An Objective Method of Weighting in Similarity Analysis , 1964, Nature.

[167]  S. John,et al.  On Identifying the Population of Origin of Each Observation in a Mixture of Observations from Two Normal Populations , 1970 .

[168]  D. W. Goodall,et al.  Hypothesis-testing in Classification , 1966, Nature.

[169]  L. A. Goodman,et al.  Measures of Association for Cross Classifications. II: Further Discussion and References , 1959 .

[170]  W. T. Williams THE PROBLEM OF ATTRIBUTE‐WEIGHTING IN NUMERICAL CLASSIFICATION , 1969 .

[171]  P. H. A. Sneath,et al.  Some experiments in the numerical analysis of archaeological data , 1966 .

[172]  J. W. Muir THE GENERAL PRINCIPLES OF CLASSIFICATION WITH REFERENCE TO SOILS , 1962 .

[173]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[174]  J. A. Gengerelli A method for detecting subgroups in a population and specifying their membership. , 1963, The Journal of psychology.

[175]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[176]  W. T. Williams,et al.  Multivariate methods in plant ecology. 2. The use of an electronic digital computer for association-analysis. , 1960 .

[177]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[178]  N. E. Day Estimating the components of a mixture of normal distributions , 1969 .

[179]  W. T. Williams,et al.  The taxonomy of Salvia: a test of two radically different numerical methods , 1968 .

[180]  Robert E. Jensen,et al.  A Dynamic Programming Algorithm for Cluster Analysis , 1969, Oper. Res..

[181]  R. M. Needham,et al.  COMPUTER METHODS FOR CLASSIFICATION AND GROUPING , 1965 .

[182]  P. Dagnelie,et al.  À propos des différentes méthodes de classification numérique , 1966 .