Feature Selection Approach for Quantitative Prediction of Transcriptional Activities

Protein-DNA interactions play a crucial role in transcriptional regulation and other biological processes. Quantitative predictive models of protein-DNA binding affinities can increase our understanding of molecular interaction and help validate putative transcription factor binding sites or other regulatory features. Such predictive models must take into account context-specific features associated with both DNA and proteins. Given the large complexity associated with such features, here we consider only the contextual features of DNA associated with binding affinity. Two types of features are considered in this paper: 1) features accounting for conformational and physico-chemical properties of nucleotide sequence and 2) another set of features accounting for conservation of evolutionary information in the form of position-specific weight matrices. A feature selection approach, named, leave-one-out sequential forward selection (LOOSFS), is presented. The feature selection method employs leave-one-out cross-validation error of the least square support vector machines (LS-SVM) to estimate the test error of quantitative prediction model. The method is used to identify important features possibly responsible for differences in transcriptional activities of 130 DNA sequences. These sequences were obtained by single base substitutions within promoter of the mouse beta-major globin gene. The selected features and predicted activity values correlate well with experimental results

[1]  J. Wade Davis,et al.  Statistical Pattern Recognition , 2003, Technometrics.

[2]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[3]  D. Baker,et al.  Protein–DNA binding specificity predictions with structural models , 2005, Nucleic acids research.

[4]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[5]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[6]  Xin Yao,et al.  Gene selection algorithms for microarray data based on least squares support vector machine , 2006, BMC Bioinformatics.

[7]  Rolf Backofen,et al.  A multiple-feature framework for modelling and predicting transcription factor binding sites , 2005, Bioinform..

[8]  E. Wingender,et al.  MATCH: A tool for searching transcription factor binding sites in DNA sequences. , 2003, Nucleic acids research.

[9]  B. De Moor,et al.  Toucan: deciphering the cis-regulatory logic of coregulated genes. , 2003, Nucleic acids research.

[10]  Constantin F. Aliferis,et al.  Towards Principled Feature Selection: Relevancy, Filters and Wrappers , 2003, AISTATS.

[11]  Xin Zhou,et al.  LS Bound based gene selection for DNA microarray data , 2005, Bioinform..

[12]  Chuong B. Do,et al.  Access the most recent version at doi: 10.1101/gr.926603 References , 2003 .

[13]  Martin Vingron,et al.  CORG: a database for COmparative Regulatory Genomics , 2003, Nucleic Acids Res..

[14]  Akinori Sarai,et al.  ACTIVITY: a database on DNA/RNA sites activity adapted to apply sequence-activity relationships from one system to another , 2001, Nucleic Acids Res..

[15]  M Sieber,et al.  High affinity binding of MEF-2C correlates with DNA bending. , 1997, Nucleic acids research.

[16]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[17]  G. Christian Overton,et al.  Identification of sequence-dependent DNA features correlating to activity of DNA sites interacting with proteins , 1999, Bioinform..

[18]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[19]  Simon J. Hubbard,et al.  SiteSeer: visualisation and analysis of transcription factor binding sites in nucleotide sequences , 2003, Nucleic Acids Res..

[20]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[21]  H. Kono,et al.  Protein-DNA recognition patterns and predictions. , 2005, Annual review of biophysics and biomolecular structure.

[22]  G. Fogel,et al.  A statistical analysis of the TRANSFAC database. , 2005, Bio Systems.

[23]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[24]  R. Myers,et al.  Fine structure genetic analysis of a beta-globin promoter. , 1986, Science.

[25]  G. Christian Overton,et al.  Conformational and physicochemical DNA features specific for transcription factor binding sites , 1999, Bioinform..