Application of neural networks and information theory to the identification of E. coli transcriptional promoters

The Humane Genome Project has as its eventual goal the determination of the entire DNA sequence of man, which comprises approximately 3 billion base pairs. An important aspect of this project will be the analysis of the sequence to locate regions of biological importance. New computer methods will be needed to automate and facilitate this task. In this paper, we have investigated use of neural networks for the recognition of functional patterns in biological sequences. The prediction of Escherichia coli transcriptional promoters was chosen as a model system for these studies. Two approaches were employed. In the fist method, a mutual information analysis of promoter and nonpromoter sequences was carried out to demonstrate the informative base positions that help to distinguish promoter sequences from non-promoter sequences. These base positions were than used to train a Perceptron to predict new promoter sequences. In the second method, the experimental knowledge of promoters was used to indicate the important base positions in the sequence. These base positions were used to train a back propagation network with hidden units which represented regions of sequence conservation found in promoters. With both types of networks, prediction of new promoter sequences was greater than 96.9%. 12 refs.,more » 1 fig., 4 tabs.« less