GENCODE: producing a reference annotation for ENCODE

BackgroundThe GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.ResultsThe GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.ConclusionIn total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.

[1]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[2]  Richard Mott,et al.  EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA , 1997, Comput. Appl. Biosci..

[3]  S. Eddy,et al.  tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. , 1997, Nucleic acids research.

[4]  G. Benson,et al.  Tandem repeats finder: a program to analyze DNA sequences. , 1999, Nucleic acids research.

[5]  R. Guigó,et al.  GeneID in Drosophila. , 2000, Genome research.

[6]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[7]  V. Solovyev,et al.  Ab initio gene finding in Drosophila genomic DNA. , 2000, Genome research.

[8]  E L Sonnhammer,et al.  Integrated graphical analysis of protein sequence features predicted from sequence composition , 2001, Proteins.

[9]  J. Mattick Non‐coding RNAs: the architects of eukaryotic complexity , 2001, EMBO reports.

[10]  A. Reymond,et al.  From PREDs and open reading frames to cDNA isolation: revisiting the human chromosome 21 transcription map. , 2001, Genomics.

[11]  R. Guigó,et al.  SGP-1: prediction and validation of homologous genes based on sequence alignments. , 2001, Genome research.

[12]  D R Bentley,et al.  The DNA sequence and comparative analysis of human chromosome 20 , 2004, Nature.

[13]  S. Brenner,et al.  Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[14]  C. V. Jongeneel,et al.  Nineteen additional unpredicted transcripts from human chromosome 21. , 2002, Genomics.

[15]  T. Hubbard,et al.  Computational detection and location of transcription start sites in mammalian genomic DNA. , 2002, Genome research.

[16]  M. Kozak,et al.  Emerging links between initiation of translation and human diseases , 2002, Mammalian Genome.

[17]  M. Brent,et al.  The effects of evolutionary distance on TWINSCAN, an algorithm for pair-wise comparative gene prediction. , 2003, Cold Spring Harbor symposia on quantitative biology.

[18]  J. Kawai,et al.  Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  M. Brent,et al.  Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  D. Bartel MicroRNAs Genomics, Biogenesis, Mechanism, and Function , 2004, Cell.

[21]  R. Durbin,et al.  GeneWise and Genomewise. , 2004, Genome research.

[22]  E. Lander,et al.  Finishing the euchromatic sequence of the human genome , 2004 .

[23]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[24]  S. Searle,et al.  The Ensembl analysis pipeline. , 2004, Genome research.

[25]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[26]  Vivek Iyer,et al.  The otter annotation system. , 2004, Genome research.

[27]  Eduardo Eyras,et al.  ESTGenes: alternative splicing from ESTs in Ensembl. , 2004, Genome research.

[28]  Simon C. Potter,et al.  An overview of Ensembl. , 2004, Genome research.

[29]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[30]  International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome , 2004 .

[31]  E. Liu,et al.  Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation , 2005, Nature Methods.

[32]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[33]  C. Will,et al.  Splicing of a rare class of introns by the U12-dependent spliceosome , 2005, Biological chemistry.

[34]  Sanghyuk Lee,et al.  ECgene: genome annotation for alternative splicing , 2004, Nucleic Acids Res..

[35]  Philipp Kapranov,et al.  Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. , 2005, Genome research.

[36]  Christopher B. Burge,et al.  Recognition of Unknown Conserved Alternatively Spliced Exons , 2005, PLoS Comput. Biol..

[37]  E. Birney,et al.  EGASP: the human ENCODE Genome Annotation Assessment Project , 2006, Genome Biology.

[38]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[39]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..