Error-correcting DNA barcodes for high-throughput sequencing

Many large-scale high-throughput experiments use DNA barcodes—short DNA sequences prepended to DNA libraries—for identification of individuals in pooled biomolecule populations. However, DNA synthesis and sequencing errors confound the correct interpretation of observed barcodes and can lead to significant data loss or spurious results. Widely-used error-correcting codes borrowed from computer science (e.g., Hamming and Levenshtein codes) do not properly account for insertions and deletions in DNA barcodes, even though deletions are the most common type of synthesis error. Here, we present and experimentally validate FREE (Filled/truncated Right End Edit) barcodes, which correct substitution, insertion, and deletion errors, even when these errors alter the barcode length. FREE barcodes are designed with experimental considerations in mind, including balanced GC content, minimal homopolymer runs, and reduced internal hairpin propensity. We generate and include lists of barcodes with different lengths and error-correction levels that may be useful in diverse high-throughput applications, including >106 single-error correcting 16-mers that strike a balance between decoding accuracy, barcode length, and library size. Moreover, concatenating two or more FREE codes into a single barcode increases the available barcode space combinatorially, generating lists with > 1015 error-correcting barcodes. The included software for creating barcode libraries and decoding sequenced barcodes is efficient and designed to be user-friendly for the general biology community. SIGNIFICANCE STATEMENT Modern high-throughput biological assays study pooled populations of individual members by labeling each member with a unique DNA sequence called a “barcode.” DNA barcodes are frequently corrupted by DNA synthesis and sequencing errors, leading to significant data loss and incorrect data interpretation. Here, we describe a novel error-correction strategy to improve the efficiency and statistical power of DNA barcodes. To our knowledge, this is the first report of an error-correcting method that accurately handles insertions and deletions in DNA barcodes, the most common type of error encountered during DNA synthesis and sequencing, resulting in order-of-magnitude increases in accuracy, efficiency, and signal-to-noise. The accompanying software package makes deployment of these barcodes effortless for the broader experimental scientist community.

[1]  Satoru Miyano,et al.  Large-scale DNA Barcode Library Generation for Biomolecule Identification in High-throughput Screens , 2017, Scientific Reports.

[2]  L. Hood,et al.  Integrated barcode chips for rapid, multiplexed analysis of proteins in microliter quantities of blood , 2008, Nature Biotechnology.

[3]  A. J. van Zanten,et al.  Lexicographic Order and Linearity , 1997, Des. Codes Cryptogr..

[4]  W. W. Peterson,et al.  Error-Correcting Codes. , 1962 .

[5]  Dario Neri,et al.  DNA-encoded chemical libraries: foundations and applications in lead discovery. , 2016, Drug discovery today.

[6]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[7]  Allon M. Klein,et al.  Single-cell barcoding and sequencing using droplet microfluidics , 2016, Nature Protocols.

[8]  Justin Petrone DNA writers attract investors , 2016, Nature Biotechnology.

[9]  Jacob O Kitzman,et al.  Haplotypes drop by drop , 2016, Nature Biotechnology.

[10]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[11]  S. Teichmann,et al.  A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications , 2017, Genome Medicine.

[12]  J. Loparo,et al.  Mapping DNA polymerase errors by single-molecule sequencing , 2016, Nucleic acids research.

[13]  D. Ashlock,et al.  Construction of Optimal Edit Metric Codes , 2006, 2006 IEEE Information Theory Workshop - ITW '06 Chengdu.

[14]  Angus M. Sidore,et al.  Multiplexed Gene Synthesis in Emulsions for Exploring Protein Functional Landscapes , 2017 .

[15]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[16]  Tilo Buschmann,et al.  Levenshtein error-correcting barcodes for multiplexed DNA sequencing , 2013, BMC Bioinformatics.

[17]  Christoph E. Dumelin,et al.  Encoded Library Synthesis Using Chemical Ligation and the Discovery of sEH Inhibitors from a 334-Million Member Library , 2015, Scientific Reports.

[18]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[19]  Adam H. Marblestone,et al.  Gene Assembly from Chip‐Synthesized Oligonucleotides , 2012, Current protocols in chemical biology.

[20]  G. Church,et al.  Large-scale de novo DNA synthesis: technologies and applications , 2014, Nature Methods.

[21]  F. Lemmermeyer Error-correcting Codes , 2005 .

[22]  Christoph E. Dumelin,et al.  Encoded self-assembling chemical libraries , 2004, Nature Biotechnology.

[23]  Serafim Batzoglou,et al.  Genome-wide reconstruction of complex structural variants using read clouds , 2016, Nature Methods.

[24]  Evan Z. Macosko,et al.  Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets , 2015, Cell.

[25]  Allon M. Klein,et al.  Droplet Barcoding for Single-Cell Transcriptomics Applied to Embryonic Stem Cells , 2015, Cell.

[26]  Rong Fan,et al.  A Clinical Microchip for Evaluation of Single Immune Cells Reveals High Functional Heterogeneity in Phenotypically Similar T Cells Nih Public Access Author Manuscript Design Rationale and Detection Limit of the Scbc Online Methods Microchip Fabrication On-chip Secretion Profiling Supplementary Mater , 2022 .

[27]  Michael Zuker,et al.  UNAFold: software for nucleic acid folding and hybridization. , 2008, Methods in molecular biology.

[28]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[29]  O. Antoine,et al.  Theory of Error-correcting Codes , 2022 .

[30]  Joakim Lundeberg,et al.  TagGD: Fast and Accurate Software for DNA Tag Generation and Demultiplexing , 2013, PloS one.