Annotation guided local similarity search in multiple sequences and its application to mitochondrial genomes

Given a set of nucleotide sequences and corresponding gene annotations which might contain a moderate number of errors we consider the problem to identify common substrings occurring in homologous genes and to identify putative errors in the given annotations. The problem is solved by identifying nodes in a suffix tree that contains all substrings occurring in the data set. Due to the large size of the targeted data set our approach employs a truncated version of suffix trees. The approach is successfully applied to the mitochondrial nucleotide sequences and the corresponding annotations available in RefSeq for more than 2000 metazoan species. We demonstrate that the approach finds appropriate subsequences despite of errors in the given annotations. Moreover, it identifies several hundred errors within the RefSeq annotations.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Marcel H. Schulz,et al.  The generalised k-Truncated Suffix Tree for time-and space-efficient searches in multiple DNA or protein sequences , 2008, Int. J. Bioinform. Res. Appl..

[3]  Joong Chae Na,et al.  Truncated suffix trees and their application to data compression , 2003, Theor. Comput. Sci..

[4]  C. Gissi,et al.  Evolution of the mitochondrial genome of Metazoa as exemplified by comparison of congeneric species , 2008, Heredity.

[5]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[6]  P. Stadler,et al.  Evolution of Mitochondrial Gene Orders in Echinoderms , 2022 .

[7]  J. Boore,et al.  Big trees from little genomes: mitochondrial gene order as a phylogenetic tool. , 1998, Current Opinion in Genetics and Development.

[8]  Michael C Whitlock,et al.  The incomplete natural history of mitochondria , 2004, Molecular ecology.

[9]  P. Stadler,et al.  Improved systematic tRNA gene annotation allows new insights into the evolution of mitochondrial tRNA structures and into the mechanisms of mitochondrial genome rearrangements , 2011, Nucleic acids research.

[10]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[11]  John Riedl,et al.  Generalized suffix trees for biological sequence data: applications and implementation , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[12]  J. V. López,et al.  Complete nucleotide sequences of the domestic cat (Felis catus) mitochondrial genome and a transposed mtDNA tandem repeat (Numt) in the nuclear genome. , 1996, Genomics.

[13]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[14]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[15]  Dan Gusfield Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[16]  P. Stadler,et al.  MITOS: improved de novo metazoan mitochondrial genome annotation. , 2013, Molecular phylogenetics and evolution.

[17]  A. Jex,et al.  Toward next-generation sequencing of mitochondrial genomes--focus on parasitic worms of animals and biotechnological implications. , 2010, Biotechnology advances.

[18]  J. Boore,et al.  Requirements and standards for organelle genome databases. , 2006, Omics : a journal of integrative biology.

[19]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[20]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[21]  Julien Allali,et al.  The at most k-deep factor tree , 2003 .

[22]  C. Gissi,et al.  Nucleotide Substitution Rate of Mammalian Mitochondrial Genomes , 1999, Journal of Molecular Evolution.

[23]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.