An Indel-Resistant Error-Correcting Code for DNA-Based Information Storage

Synthetic DNA can in principle be used for the archival storage of arbitrary data. Because errors are introduced during DNA synthesis, storage, and sequencing, an error-correcting code (ECC) is necessary for error-free recovery of the data. Previous work has utilized ECCs that can correct substitution errors, but not insertion or deletion errors (indels), instead relying on sequencing depth and multiple alignment to detect and correct indels -- in effect an inefficient multiple-repetition code. This paper describes an ECC, termed "HEDGES", that corrects simultaneously for substitutions, insertions, and deletions in a single read. Varying code rates allow for correction of up to ~10% nucleotide errors and achieve 50% or better of the estimated Shannon limit.

[1]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[2]  Ewan Birney,et al.  Towards practical, high-capacity, low-maintenance information storage in synthesized DNA , 2013, Nature.

[3]  William H Press,et al.  Indel-correcting DNA barcodes for high-throughput sequencing , 2018, Proceedings of the National Academy of Sciences.

[4]  Yaniv Erlich,et al.  DNA Fountain enables a robust and efficient storage architecture , 2016, Science.

[5]  Claude E. Shannon,et al.  The zero error capacity of a noisy channel , 1956, IRE Trans. Inf. Theory.

[6]  Michael Mitzenmacher,et al.  A Survey of Results for Deletion Channels and Related Synchronization Channels , 2008, SWAT.

[7]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[8]  Luis Ceze,et al.  A DNA-Based Archival Storage System , 2017 .

[9]  Nils J. Nilsson,et al.  A Formal Basis for the Heuristic Determination of Minimum Cost Paths , 1968, IEEE Trans. Syst. Sci. Cybern..

[10]  Tolga M. Duman,et al.  Upper Bounds on the Capacity of Deletion Channels Using Channel Fragmentation , 2015, IEEE Transactions on Information Theory.

[11]  Stephen B. Wicker,et al.  An Introduction to Reed-Solomon Codes , 2005 .

[12]  Robert N Grass,et al.  Robust chemical preservation of digital information on DNA in silica with error-correcting codes. , 2015, Angewandte Chemie.

[13]  G. Church,et al.  Next-Generation Digital Information Storage in DNA , 2012, Science.

[14]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[15]  Olgica Milenkovic,et al.  Portable and Error-Free DNA-Based Data Storage , 2016, Scientific Reports.

[16]  Cyrus Rashtchian,et al.  Random access in large-scale DNA data storage , 2018, Nature Biotechnology.

[17]  T. Moon Error Correction Coding: Mathematical Methods and Algorithms , 2005 .

[18]  Ron M. Roth,et al.  Introduction to Coding Theory , 2019, Discrete Mathematics.

[19]  Shu Lin,et al.  Error Control Coding , 2004 .