Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data

被引:18
|
作者
Gupta, Ravi [1 ]
Wikramasinghe, Priyankara [1 ]
Bhattacharyya, Anirban [1 ]
Perez, Francisco A. [1 ]
Pal, Sharmistha [1 ]
Davuluri, Ramana V. [1 ,2 ]
机构
[1] Wistar Inst Anat & Biol, Mol & Cellular Oncogenesis Program, Ctr Syst & Computat Biol, Philadelphia, PA USA
[2] Univ Penn, Dept Genet, Grad Grp Genom & Computat Biol, Philadelphia, PA 19104 USA
来源
BMC BIOINFORMATICS | 2010年 / 11卷
关键词
DEPENDENT DNA-STRUCTURE; STABILITY; REVEALS; PLURIPOTENT; PARAMETERS; LOCATION; RESOURCE; MAPS;
D O I
10.1186/1471-2105-11-S1-S65
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context. Methods: We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters. Results: We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters. Conclusion: Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
引用
收藏
页数:9
相关论文
共 50 条
  • [1] Annotation of gene promoters by integrative data-mining of ChIP-seq Pol-II enrichment data
    Ravi Gupta
    Priyankara Wikramasinghe
    Anirban Bhattacharyya
    Francisco A Perez
    Sharmistha Pal
    Ramana V Davuluri
    [J]. BMC Bioinformatics, 11
  • [2] Genome annotation test with validation on transcription start site and ChIP-Seq for Pol-II binding data
    Bedo, Justin
    Kowalczyk, Adam
    [J]. BIOINFORMATICS, 2011, 27 (12) : 1610 - 1617
  • [3] ChIP-Atlas: a data-mining suite powered by full integration of public ChIP-seq data
    Oki, Shinya
    Ohta, Tazro
    Shioi, Go
    Hatanaka, Hideki
    Ogasawara, Osamu
    Okuda, Yoshihiro
    Kawaji, Hideya
    Nakaki, Ryo
    Sese, Jun
    Meno, Chikara
    [J]. EMBO REPORTS, 2018, 19 (12)
  • [4] ChIP-Enrich: gene set enrichment testing for ChIP-seq data
    Welch, Ryan P.
    Lee, Chee
    Imbriano, Paul M.
    Patil, Snehal
    Weymouth, Terry E.
    Smith, R. Alex
    Scott, Laura J.
    Sartor, Maureen A.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (13) : e105
  • [5] MPromDb update 2010: an integrated resource for annotation and visualization of mammalian gene promoters and ChIP-seq experimental data
    Gupta, Ravi
    Bhattacharyya, Anirban
    Agosto-Perez, Francisco J.
    Wickramasinghe, Priyankara
    Davuluri, Ramana V.
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 : D92 - D97
  • [6] Genome-wide mapping of RNA Pol-II promoter usage in mouse tissues by ChIP-seq
    Sun, Hao
    Wu, Jiejun
    Wickramasinghe, Priyankara
    Pal, Sharmistha
    Gupta, Ravi
    Bhattacharyya, Anirban
    Agosto-Perez, Francisco J.
    Showe, Louise C.
    Huang, Tim H. -M.
    Davuluri, Ramana V.
    [J]. NUCLEIC ACIDS RESEARCH, 2011, 39 (01) : 190 - 201
  • [7] Quantification of histone modification ChIP-seq enrichment for data mining and machine learning applications
    Hoang S.A.
    Xu X.
    Bekiranov S.
    [J]. BMC Research Notes, 4 (1)
  • [8] ChIP-Seq data reveal nucleosome architecture of human promoters
    Schmid, Christoph D.
    Bucher, Philipp
    [J]. CELL, 2007, 131 (05) : 831 - 832
  • [9] Nonparametric Tests for Differential Histone Enrichment with ChIP-Seq Data
    Wu, Qian
    Won, Kyoung-Jae
    Li, Hongzhe
    [J]. CANCER INFORMATICS, 2015, 14 : 11 - 22
  • [10] ChIP-Atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating ChIP-seq, ATAC-seq and Bisulfite-seq data
    Zou, Zhaonan
    Ohta, Tazro
    Miura, Fumihito
    Oki, Shinya
    [J]. NUCLEIC ACIDS RESEARCH, 2022, 50 (W1) : W175 - W182