Probabilistic Harmonization and Annotation of Single-cell Transcriptomics Data with Deep Generative Models

As single-cell transcriptomics becomes a mainstream technology, the natural next step is to integrate the accumulating data in order to achieve a common ontology of cell types and states. However, owing to various nuisance factors of variation, it is not straightforward how to compare gene expression levels across data sets and how to automatically assign cell type labels in a new data set based on existing annotations. In this manuscript, we demonstrate that our previously developed method, scVI, provides an effective and fully probabilistic approach for joint representation and analysis of cohorts of single-cell RNA-seq data sets, while accounting for uncertainty caused by biological and measurement noise. We also introduce single-cell ANnotation using Variational Inference (scANVI), a semi-supervised variant of scVI designed to leverage any available cell state annotations — for instance when only one data set in a cohort is annotated, or when only a few cells in a single data set can be labeled using marker genes. We demonstrate that scVI and scANVI compare favorably to the existing methods for data integration and cell state annotation in terms of accuracy, scalability, and adaptability to challenging settings such as a hierarchical structure of cell state labels. We further show that different from existing methods, scVI and scANVI represent the integrated datasets with a single generative model that can be directly used for any probabilistic decision making task, using differential expression as our case study. scVI and scANVI are available as open source software and can be readily used to facilitate cell state annotation and help ensure consistency and reproducibility across studies.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[3]  J. Gribben,et al.  Chronic lymphocytic leukemia cells induce changes in gene expression of CD4 and CD8 T cells. , 2005, The Journal of clinical investigation.

[4]  Yee Whye Teh,et al.  A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation , 2006, NIPS.

[5]  Tony O’Hagan Bayes factors , 2006 .

[6]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[7]  M. Cam,et al.  The human reticulocyte transcriptome. , 2007, Physiological genomics.

[8]  D. Koller,et al.  The Immunological Genome Project: networks of gene expression in immune cells , 2008, Nature Immunology.

[9]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[10]  Eva K. Lee,et al.  Systems Biology of Seasonal Influenza Vaccination in Humans , 2011, Nature Immunology.

[11]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[12]  Barbara Caputo,et al.  Frustratingly Easy NBNN Domain Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[13]  Philip S. Yu,et al.  Transfer Feature Learning with Joint Distribution Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[14]  O. Troyanskaya,et al.  Defining cell-type specificity at the transcriptional level in human disease , 2013, Genome research.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[17]  I. Amit,et al.  Massively Parallel Single-Cell RNA-Seq for Marker-Free Decomposition of Tissues into Cell Types , 2014, Science.

[18]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[19]  Åsa K. Björklund,et al.  Full-length RNA-seq from single cells using Smart-seq2 , 2014, Nature Protocols.

[20]  Shawn M. Gillespie,et al.  Single-cell RNA-seq highlights intratumoral heterogeneity in primary glioblastoma , 2014, Science.

[21]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[22]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[23]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[25]  I. Amit,et al.  Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors , 2016, Cell.

[26]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[27]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[28]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Rona S. Gertner,et al.  Single-Cell Genomics Unveils Critical Regulators of Th17 Cell Pathogenicity , 2015, Cell.

[31]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[32]  Samuel L. Wolock,et al.  A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. , 2016, Cell systems.

[33]  Ole Winther,et al.  Ladder Variational Autoencoders , 2016, NIPS.

[34]  A. Regev,et al.  Revealing the vectors of cellular identity with single-cell genomics , 2016, Nature Biotechnology.

[35]  Ole Winther,et al.  Auxiliary Deep Generative Models , 2016, ICML.

[36]  Nir Yosef,et al.  FastProject: a tool for low-dimensional analysis of single-cell RNA-Seq data , 2016, BMC Bioinformatics.

[37]  E. Hovig,et al.  Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses , 2015, Biostatistics.

[38]  Hsin C. Lin,et al.  Use of the Fluidigm C1 platform for RNA sequencing of single mouse pancreatic islet cells , 2016, Proceedings of the National Academy of Sciences.

[39]  Mauro J. Muraro,et al.  A Single-Cell Transcriptome Atlas of the Human Pancreas , 2016, Cell systems.

[40]  Max Welling,et al.  The Variational Fair Autoencoder , 2015, ICLR.

[41]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[42]  Koji Tsuda,et al.  CellTree: an R/bioconductor package to infer the hierarchical structure of cell populations from single-cell RNA-seq data , 2016, BMC Bioinformatics.

[43]  Shuqiang Li,et al.  CEL-Seq2: sensitive highly-multiplexed single-cell RNA-Seq , 2016, Genome Biology.

[44]  S. Linnarsson,et al.  Conserved properties of dentate gyrus neurogenesis across postnatal development revealed by single-cell RNA sequencing , 2018, Nature Neuroscience.

[45]  Christoph Ziegenhain,et al.  powsimR: Power analysis for bulk and single cell RNA-seq experiments , 2017, bioRxiv.

[46]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[47]  Ian R. Wickersham,et al.  The BRAIN Initiative Cell Census Consortium: Lessons Learned toward Generating a Comprehensive Brain Cell Atlas , 2017, Neuron.

[48]  Fabian J Theis,et al.  Single cells make big data: New challenges and opportunities in transcriptomics , 2017 .

[49]  A. Regev,et al.  Scaling single-cell genomics from phenomenology to mechanism , 2017, Nature.

[50]  Dongfang Wang,et al.  VASC: dimension reduction and visualization of single cell RNA sequencing data by deep variational autoencoder , 2017, bioRxiv.

[51]  James T. Webber,et al.  Single-cell transcriptomic characterization of 20 organs and tissues from individual mice creates a Tabula Muris , 2017 .

[52]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[53]  Jun Zhao,et al.  Removal of batch effects using distribution‐matching residual networks , 2016, Bioinform..

[54]  S. Dudoit,et al.  A general and flexible method for signal extraction from single-cell RNA-seq data , 2018, Nature Communications.

[55]  Jacob Goldberger,et al.  Training deep neural-networks using a noise adaptation layer , 2016, ICLR.

[56]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  I. Hellmann,et al.  Comparative Analysis of Single-Cell RNA Sequencing Methods , 2016, bioRxiv.

[58]  Sandrine Dudoit,et al.  Normalizing single-cell RNA sequencing data: challenges and opportunities , 2017, Nature Methods.

[59]  H. Swerdlow,et al.  Large-scale simultaneous measurement of epitopes and transcriptomes in single cells , 2017, Nature Methods.

[60]  T. Mikkelsen,et al.  Dynamics of lineage commitment revealed by single-cell transcriptomics of differentiating embryonic stem cells , 2016, Nature Communications.

[61]  M. Hemberg,et al.  Dropout-based feature selection for scRNASeq , 2018 .

[62]  Lai Guan Ng,et al.  Dimensionality reduction for visualizing single-cell data using UMAP , 2018, Nature Biotechnology.

[63]  Smita Krishnaswamy,et al.  MAGAN: Aligning Biological Manifolds , 2018, ICML.

[64]  Jung-Woo Ha,et al.  StarGAN: Unified Generative Adversarial Networks for Multi-domain Image-to-Image Translation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[65]  Kevin R. Moon,et al.  Exploring single-cell data with deep multitasking neural networks , 2017, Nature Methods.

[66]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[67]  Anne Condon,et al.  Interpretable dimensionality reduction of single cell transcriptome data with deep generative models , 2018, Nature Communications.

[68]  L. Held,et al.  On p-Values and Bayes Factors , 2018 .

[69]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[70]  Nir Yosef,et al.  Functional interpretation of single cell similarity maps , 2018, Nature Communications.

[71]  Bryan D. Bryson,et al.  Panoramic stitching of heterogeneous single-cell transcriptomic data , 2018, bioRxiv.

[72]  Laleh Haghverdi,et al.  Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors , 2018, Nature Biotechnology.

[73]  Allon M. Klein,et al.  Lineage tracing on transcriptional landscapes links state to fate during differentiation , 2018, Science.

[74]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[75]  Nir Yosef,et al.  SymSim: simulating multi-faceted variability in single cell RNA sequencing , 2018, bioRxiv.

[76]  Kenneth D. Harris,et al.  Molecular architecture of the mouse nervous system , 2018 .

[77]  Rodrigo C. Barros,et al.  Hierarchical Multi-Label Classification Networks , 2018, ICML.

[78]  Taesung Park,et al.  CyCADA: Cycle-Consistent Adversarial Domain Adaptation , 2017, ICML.

[79]  Nuno A. Fonseca,et al.  Expression Atlas: gene and protein expression across multiple studies and organisms , 2017, Nucleic Acids Res..

[80]  Michael I. Jordan,et al.  Information Constraints on Auto-Encoding Variational Bayes , 2018, NeurIPS.

[81]  Florian Wagner,et al.  Moana: A robust and scalable cell type classification framework for single-cell RNA-Seq data , 2018, bioRxiv.

[82]  Charlotte Soneson,et al.  Bias, robustness and scalability in single-cell differential expression analysis , 2018, Nature Methods.

[83]  Lu Wen,et al.  Boosting the power of single-cell analysis , 2018, Nature Biotechnology.

[84]  Samuel L. Wolock,et al.  Population Snapshots Predict Early Hematopoietic and Erythroid Hierarchies , 2018, Nature.

[85]  Michael I. Jordan,et al.  A Deep Generative Model for Semi-Supervised Classification with Noisy Labels , 2018, ArXiv.

[86]  Christoph Hafemeister,et al.  Comprehensive integration of single cell data , 2018, bioRxiv.

[87]  Fabian J. Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2018, Nature Communications.

[88]  Debora S Marks,et al.  Deep generative models of genetic variation capture the effects of mutations , 2018, Nature Methods.

[89]  M. Hemberg,et al.  scmap: projection of single-cell RNA-seq data across data sets , 2018, Nature Methods.

[90]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[91]  Evan Z. Macosko,et al.  Integrative inference of brain cell similarities and differences from single-cell genomics , 2018 .

[92]  Jin Gu,et al.  VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder , 2018, Genom. Proteom. Bioinform..

[93]  C. Greene,et al.  Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics , 2018, PSB.

[94]  Sandrine Dudoit,et al.  Performance Assessment and Selection of Normalization Procedures for Single-Cell RNA-Seq. , 2019, Cell systems.

[95]  Michael I. Jordan,et al.  A joint model of unpaired data from scRNA-seq and spatial transcriptomics for imputing missing gene expression measurements , 2019, ArXiv.

[96]  Casper Kaae Sønderby,et al.  scVAE: Variational auto-encoders for single-cell gene expression data , 2018, bioRxiv.

[97]  Valentine Svensson,et al.  Droplet scRNA-seq is not zero-inflated , 2019, Nature Biotechnology.

[98]  Bonnie Berger,et al.  Efficient integration of heterogeneous single-cell transcriptomes using Scanorama , 2019, Nature Biotechnology.

[99]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[100]  Evan Z. Macosko,et al.  Single-Cell Multi-omic Integration Compares and Contrasts Features of Brain Cell Identity , 2019, Cell.

[101]  N. Yosef,et al.  Integrated single cell analysis of blood and cerebrospinal fluid leukocytes in multiple sclerosis , 2018, Nature Communications.