Combinatorial Pattern Matching

There is a pressing need to align growing set of expressed sequence tags (ESTs) to newly sequenced human genome that is still frequently revised, for providing biologists and medical scientists with fresh information. The problem is, however, complicated by the exon/intron structure of eucaryotic genes, misread nucleotides in ESTs, and millions of repeptive sequences in genomic sequences. Indeed, to solve this, algorithms that use dynamic programming have been proposed, in which space complexity is O(N) and time complexity is O(MN) for a genomic sequence of length M and an EST of length N , but in reality, these algorithms require an enormous amount of processing time. In an effort to improve the computational efficiency of these classical DP algorithms, we develop software that fully utilizes the lookup-table that stores the position at which each short subsequence occurs in the genomic sequence for allowing the efficient detection of the startand endpoints of an EST within a given DNA sequence, and subsequently, the prompt identification of exons and introns. In addition, high sensitivity and accuracy must be achieved by calculating locations of all spliced sites correctly for more ESTs while retaining high computational efficiency. This goal is hard to accomplish in practice, owing to misread nucleotides in ESTs and repeptive sequences in the genome, but we present a couple of heuristics effective in settling this issue. Experimental results have confirmed that our technique improves the overall computation time by orders of magnitude compared with common tools such as sim4 and BLAT, and attains high sensitivity and accuracy against datasets of clean and documented genes at the same time. Consequently, our software is able to align about three millions of ESTs to a draft genome in less than one day, and all the information is available through the WWW at http://grl.gi.k.u-tokyo.ac.jp/. A. Apostolico and M. Takeda (Eds.): CPM 2002, LNCS 2373, pp. 1–16, 2002. c © Springer-Verlag Berlin Heidelberg 2002 2 Jun Ogasawara and Shinichi Morishita

[1]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[2]  V. Makinen Using edit distance in point-pattern matching , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[3]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[4]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[5]  Dimitrios Gunopulos,et al.  Episode Matching , 1997, CPM.

[6]  Veli Mäkinen,et al.  Compact Suffix Array , 2000, CPM.

[7]  J. H. Venter,et al.  Finding multiple abrupt change points , 1996 .

[8]  Heikki Mannila,et al.  Distance measures for point sets and their computation , 1997, Acta Informatica.

[9]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[10]  Dorit S. Hochbaum,et al.  An efficient algorithm for image segmentation, Markov random fields and related problems , 2001, JACM.

[11]  Éva Tardos,et al.  Approximation algorithms for classification problems with pairwise relationships: metric labeling and Markov random fields , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[12]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[13]  R D Appel,et al.  Melanie II – a third‐generation software package for analysis of two‐dimensional electrophoresis images: II. Algorithms , 1997, Electrophoresis.

[14]  M. Queyranne Performance ratio of polynomial heuristics for triangle inequality quadratic assignment problems , 1986 .

[15]  Esko Ukkonen,et al.  On Approximate String Matching , 1983, FCT.

[16]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[17]  Tatsuya Akutsu,et al.  Matching of Spots in 2D Electrophoresis Images. Point Matching Under Non-uniform Distortions , 1999, CPM.

[18]  Dominique Revuz,et al.  Minimisation of Acyclic Deterministic Automata in Linear Time , 1992, Theor. Comput. Sci..

[19]  Andrew V. Goldberg,et al.  An efficient cost scaling algorithm for the assignment problem , 1995, Math. Program..