Automated identification of stratifying signatures in cellular subpopulations

Significance Single-cell measurement technologies such as flow cytometry permit the investigation of specific cellular subpopulations. Mass cytometry currently measures >40 parameters per cell and produces phenotypically rich datasets that may be retrospectively interrogated for relevant biological signal. There are few methods that identify experimentally relevant subpopulations within these datasets, and most do not scale well to higher-dimensional measurements. To address this bottleneck, we present a data-driven method termed Citrus that identifies cell subsets associated with an experimental endpoint of interest. Citrus can test diverse experimental hypotheses and is demonstrated through the systematic identification of (i) blood cells that signal in response to experimental stimuli and (ii) T-cell subsets whose abundance is predictive of AIDS-free survival risk in patients with HIV. Elucidation and examination of cellular subpopulations that display condition-specific behavior can play a critical contributory role in understanding disease mechanism, as well as provide a focal point for development of diagnostic criteria linking such a mechanism to clinical prognosis. Despite recent advancements in single-cell measurement technologies, the identification of relevant cell subsets through manual efforts remains standard practice. As new technologies such as mass cytometry increase the parameterization of single-cell measurements, the scalability and subjectivity inherent in manual analyses slows both analysis and progress. We therefore developed Citrus (cluster identification, characterization, and regression), a data-driven approach for the identification of stratifying subpopulations in multidimensional cytometry datasets. The methodology of Citrus is demonstrated through the identification of known and unexpected pathway responses in a dataset of stimulated peripheral blood mononuclear cells measured by mass cytometry. Additionally, the performance of Citrus is compared with that of existing methods through the analysis of several publicly available datasets. As the complexity of flow cytometry datasets continues to increase, methods such as Citrus will be needed to aid investigators in the performance of unbiased—and potentially more thorough—correlation-based mining and inspection of cell subsets nested within high-dimensional datasets.

[1]  Maria Grazia Valsecchi,et al.  Risk of relapse of childhood acute lymphoblastic leukemia is predicted by flow cytometric measurement of residual disease on day 15 bone marrow. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[2]  Raphael Gottardo,et al.  Automated gating of flow cytometry data via robust model‐based clustering , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[3]  T. Lumley,et al.  Time‐Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker , 2000, Biometrics.

[4]  Mario Roederer,et al.  Immunologic and virologic events in early HIV infection predict subsequent rate of progression. , 2010, The Journal of infectious diseases.

[5]  Noah Zimmerman,et al.  Automatic Clustering of Flow Cytometry Data with Density-Based Merging , 2009, Adv. Bioinformatics.

[6]  R. Scheuermann,et al.  Elucidation of seventeen human peripheral blood B‐cell subsets and quantification of the tetanus response using a density‐based method for the automated identification of cell populations in multidimensional flow cytometry data , 2010, Cytometry. Part B, Clinical cytometry.

[7]  M. Altfeld,et al.  Standardization of cytokine flow cytometry assays , 2005, BMC Immunology.

[8]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[9]  R. Murphy Automated identification of subpopulations in flow cytometric list mode data using cluster analysis. , 1985, Cytometry.

[10]  M. Roederer,et al.  CD8 naive T cell counts decrease progressively in HIV-infected adults. , 1995, The Journal of clinical investigation.

[11]  Robert K Hills,et al.  Prognostic relevance of treatment response measured by flow cytometric residual disease detection in older patients with acute myeloid leukemia. , 2013, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  Pratip K. Chattopadhyay,et al.  Early immunologic correlates of HIV protection can be identified from computational analysis of complex multivariate T-cell flow cytometry assays , 2012, Bioinform..

[13]  M. Roediger,et al.  Increasing Age at HIV Seroconversion From 18 to 40 Years Is Associated With Favorable Virologic and Immunologic Responses to HAART , 2008, Journal of acquired immune deficiency syndromes.

[14]  Sean C. Bendall,et al.  Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum , 2011, Science.

[15]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Ash A. Alizadeh,et al.  B-cell signaling networks reveal a negative prognostic human lymphoma cell subset that emerges during tumor progression , 2010, Proceedings of the National Academy of Sciences.

[17]  S. Sealfon,et al.  flowPeaks: a fast unsupervised clustering for flow cytometry data via K-means and density peak finding , 2012, Bioinform..

[18]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[19]  Ryan R Brinkman,et al.  Rapid cell population identification in flow cytometry data , 2011, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[20]  Trevor Hastie,et al.  Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent. , 2011, Journal of statistical software.

[21]  V. Appay,et al.  Phenotype and function of human T lymphocyte subsets: Consensus and issues , 2008, Cytometry. Part A : the journal of the International Society for Analytical Cytology.

[22]  Greg Finak,et al.  Critical assessment of automated flow cytometry data analysis techniques , 2013, Nature Methods.

[23]  Jonathan M Irish,et al.  Single-cell profiling identifies aberrant STAT5 activation in myeloid malignancies with specific clinical and biologic correlates. , 2008, Cancer cell.

[24]  Arvind Gupta,et al.  Data reduction for spectral clustering to analyze high throughput flow cytometry data , 2010, BMC Bioinformatics.

[25]  J. Mesirov,et al.  Automated high-dimensional flow cytometric data analysis , 2009, Proceedings of the National Academy of Sciences.

[26]  Richard M. Simon,et al.  Using cross-validation to evaluate predictive accuracy of survival risk classifiers based on high-dimensional data , 2011, Briefings Bioinform..

[27]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[28]  Stuart C. Sealfon,et al.  Misty Mountain clustering: application to fast unsupervised flow cytometry gating , 2010, BMC Bioinformatics.

[29]  Greg Finak,et al.  Merging Mixture Components for Cell Population Identification in Flow Cytometry , 2009, Adv. Bioinformatics.

[30]  Karen Sachs,et al.  Multiplexed mass cytometry profiling of cellular states perturbed by small-molecule regulators , 2012, Nature Biotechnology.