A numerical measure of the instability of Mapper-type algorithms

Mapper is an unsupervised machine learning algorithm generalising the notion of clustering to obtain a geometric description of a dataset. The procedure splits the data into possibly overlapping bins which are then clustered. The output of the algorithm is a graph where nodes represent clusters and edges represent the sharing of data points between two clusters. However, several parameters must be selected before applying Mapper and the resulting graph may vary dramatically with the choice of parameters. We define an intrinsic notion of Mapper instability that measures the variability of the output as a function of the choice of parameters required to construct a Mapper output. Our results and discussion are general and apply to all Mapper-type algorithms. We derive theoretical results that provide estimates for the instability and suggest practical ways to control it. We provide also experiments to illustrate our results and in particular we demonstrate that a reliable candidate Mapper output can be identified as a local minimum of instability regarded as a function of Mapper input parameters.

[1]  L. Breiman Arcing classifier (with discussion and a rejoinder by the author) , 1998 .

[2]  N. Sampas,et al.  Molecular classification of cutaneous malignant melanoma by gene expression profiling , 2000, Nature.

[3]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[4]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[5]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[6]  G. Reaven,et al.  An attempt to define the nature of chemical diabetes using a multidimensional analysis , 2004, Diabetologia.

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[9]  Alexander Rakhlin,et al.  Stability Properties of Empirical Risk Minimization over Donsker Classes , 2006, J. Mach. Learn. Res..

[10]  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[11]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[12]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13]  Shai Ben-David,et al.  A framework for statistical clustering with constant time approximation algorithms for K-median and K-means clustering , 2007, Machine Learning.

[14]  Ulrike von Luxburg,et al.  Consistent Minimization of Clustering Objective Functions , 2007, NIPS.

[15]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[16]  Facundo Mémoli,et al.  Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition , 2007, PBG@Eurographics.

[17]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.

[18]  Shai Ben-David,et al.  Measures of Clustering Quality: A Working Set of Axioms for Clustering , 2008, NIPS.

[19]  Leonidas J. Guibas,et al.  Structural Insight into RNA Hairpin Folding Intermediates , 2008, Journal of the American Chemical Society.

[20]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[21]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[22]  Facundo Mémoli,et al.  Characterization, Stability and Convergence of Hierarchical Clustering Methods , 2010, J. Mach. Learn. Res..

[23]  Ohad Shamir,et al.  Stability and model selection in k-means clustering , 2010, Machine Learning.

[24]  G. Carlsson,et al.  Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival , 2011, Proceedings of the National Academy of Sciences.

[25]  Thomas R Cox,et al.  LOXL2 induces aberrant acinar morphogenesis via ErbB2 signaling , 2013, Breast Cancer Research.

[26]  G. Carlsson,et al.  Topology of viral evolution , 2013, Proceedings of the National Academy of Sciences.

[27]  P. Y. Lum,et al.  Extracting insights from the shape of complex data using topology , 2013, Scientific Reports.

[28]  Bjoern Peters,et al.  CD8 T-cell reactivity to islet antigens is unique to type 1 while CD4 T-cell reactivity exists in both type 1 and type 2 diabetes. , 2014, Journal of autoimmunity.

[29]  Gunnar Carlsson,et al.  Topological methods reveal high and low functioning neuro‐phenotypes within fragile X syndrome , 2014, Human brain mapping.

[30]  Emanuela Merelli,et al.  Using Topological Data Analysis for diagnosis pulmonary embolism , 2014, 1409.5020.

[31]  M. Nicolau,et al.  Head and neck cancer subtypes with biological and clinical relevance: Meta-analysis of gene-expression data , 2015, Oncotarget.

[32]  Borislav D. Dimitrov,et al.  Innate and adaptive T cells in asthmatic patients: Relationship to severity and disease mechanisms , 2015, The Journal of allergy and clinical immunology.

[33]  Benjamin S. Glicksberg,et al.  Identification of type 2 diabetes subgroups through topological analysis of patient similarity , 2015, Science Translational Medicine.

[34]  Adam R Ferguson,et al.  Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury , 2015, Nature Communications.

[35]  Tamal K. Dey,et al.  Multiscale Mapper: Topological Summarization via Codomain Covers , 2016, SODA.

[36]  P. Howarth,et al.  Multidimensional endotyping in patients with severe asthma reveals inflammatory heterogeneity in matrix metalloproteinases and chitinase 3–like protein 1 , 2016, The Journal of allergy and clinical immunology.

[37]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[38]  Steve Oudot,et al.  Structure and Stability of the 1-Dimensional Mapper , 2016, SoCG.

[39]  David S Schneider,et al.  Tracking Resilience to Infections by Mapping Disease Space , 2016, PLoS biology.

[40]  Patrick S. Schnable,et al.  Toward A Scalable Exploratory Framework for Complex High-Dimensional Phenomics Data , 2017, bioRxiv.

[41]  G. Carlsson The shape of biomedical data , 2017 .

[42]  Elena K. Kandror,et al.  Single-cell topological RNA-Seq analysis reveals insights into cellular differentiation and development , 2017, Nature Biotechnology.

[43]  Pablo G. Cámara,et al.  Topological methods for genomics: present and future directions. , 2017, Current opinion in systems biology.

[44]  Tamal K. Dey,et al.  Topological Analysis of Nerves, Reeb Spaces, Mappers, and Multiscale Mappers , 2017, SoCG.

[45]  Paweł Dłotko,et al.  Quantifying similarity of pore-geometry in nanoporous materials , 2017, Nature Communications.

[46]  Ludovic Duponchel,et al.  Topological data analysis (TDA) applied to reveal pedogenetic principles of European topsoil system. , 2017, The Science of the total environment.

[47]  Steve Oudot,et al.  Two-Tier Mapper: a user-independent clustering method for global gene expression analysis based on topology , 2017, 1801.01841.

[48]  Steve Oudot,et al.  Statistical Analysis and Parameter Selection for Mapper , 2017, J. Mach. Learn. Res..

[49]  Ludovic Duponchel,et al.  When remote sensing meets topological data analysis , 2018 .

[50]  Ludovic Duponchel,et al.  Exploring hyperspectral imaging data sets with topological data analysis. , 2018, Analytica chimica acta.

[51]  Rubén J. Sánchez-García,et al.  Morse Theory and an Impossibility Theorem for Graph Clustering , 2018, ArXiv.

[52]  Mahesan Niranjan,et al.  Improved understanding of aqueous solubility modeling through topological data analysis , 2018, Journal of Cheminformatics.

[53]  C. Auffray,et al.  Stratification of asthma phenotypes by airway proteomic signatures. , 2019, The Journal of allergy and clinical immunology.

[54]  Pawel Dlotko,et al.  Ball mapper: a shape summary for topological data analysis , 2019, 1901.07410.

[55]  Henri Riihimäki,et al.  A topological data analysis based classification method for multiple measurements , 2019, BMC Bioinformatics.