Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

被引:3
|
作者
Han, Seong Kyu [1 ,5 ]
Muto, Yoshiharu [2 ]
Wilson, Parker C. [3 ]
Humphreys, Benjamin D. [2 ,4 ]
Sampson, Matthew G. [1 ,5 ]
Chakravarti, Aravinda [6 ]
Lee, Dongwo [1 ,7 ]
机构
[1] Boston & Harvard Med Sch, Boston Childrens Hosp, Dept Pediat, Div Nephrol, Boston, MA 02115 USA
[2] Washington Univ St Louis, Dept Med, Div Nephrol, St Louis, MO 63130 USA
[3] Washington Univ St Louis, Dept Pathol & munol, St Louis, MO 63130 USA
[4] Washington Univ St Louis, Dept Dev Biol, St Louis, MO 63130 USA
[5] Broad Inst & Harvard, Kidney Dis Initiat, Cambridge, MA 02142 USA
[6] New York Univ, Ctr Human Genet & Genom, Grossman Sch Med, New York, NY 10016 USA
[7] Boston Childrens Hosp, Manton Ctr Orphan Res, Boston, MA 02115 USA
关键词
quality control; chromatin accessibility; sequence-based model; gkmQC; GENOME-WIDE ASSOCIATION; BINDING PROTEINS; DNA; VISUALIZATION; ENHANCERS; VARIANTS; ENCODE; LMX1B; CHIP;
D O I
10.1073/pnas.2212810119
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) sam-ples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
引用
收藏
页数:11
相关论文
共 50 条
  • [21] Validation of a unique sequence-based detection of plant pathogens using next-generation sequence data
    Stobbe, A.
    Melcher, U. K.
    Fletcher, J.
    Schneider, W. L.
    PHYTOPATHOLOGY, 2012, 102 (07) : 114 - 115
  • [22] Development of an online tool for European working group for Legionella infections sequence-based typing, including automatic quality assessment and data submission
    Underwood, Anthony P.
    Bellamy, William
    Afshar, Baharak
    Fry, Norman K.
    Harrison, Timothy G.
    LEGIONELLA: STATE OF THE ART 30 YEARS AFTER ITS RECOGNITION, 2006, : 163 - +
  • [23] ThermalProGAN: A sequence-based thermally stable protein generator trained using unpaired data
    Huang, Hui-Ling
    Weng, Chong-Heng
    Nordling, Torbjoern E. M.
    Liou, Yi-Fan
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2023,
  • [24] Sequence-based model of gap gene regulatory network
    Kozlov, Konstantin
    Gursky, Vitaly
    Kulakovskiy, Ivan
    Samsonova, Maria
    BMC GENOMICS, 2014, 15
  • [25] Sequence-based Gaussian network model for protein dynamics
    Zhang, Hua
    Kurgan, Lukasz
    BIOINFORMATICS, 2014, 30 (04) : 497 - 505
  • [26] A Sequence-Based Neuronal Model for Mobile Robot Localization
    Neubert, Peer
    Ahmad, Subutai
    Protzel, Peter
    KI 2018: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, 11117 : 117 - 130
  • [27] Sequence-based model of gap gene regulatory network
    Konstantin Kozlov
    Vitaly Gursky
    Ivan Kulakovskiy
    Maria Samsonova
    BMC Genomics, 15
  • [28] Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data
    Pique-Regi, Roger
    Degner, Jacob F.
    Pai, Athma A.
    Gaffney, Daniel J.
    Gilad, Yoav
    Pritchard, Jonathan K.
    GENOME RESEARCH, 2011, 21 (03) : 447 - 455
  • [29] A sequence-based method for dynamic reliability assessment of MPD systems
    Zhu, Jingyu
    Chen, Guoming
    Khan, Faisal
    Yang, Ming
    Li, Xinhong
    Meng, Xiangkun
    He, Rui
    PROCESS SAFETY AND ENVIRONMENTAL PROTECTION, 2021, 146 : 927 - 942
  • [30] A sequence-based deep learning approach to predict CTCF-mediated chromatin loop
    Lv, Hao
    Dao, Fu-Ying
    Zulfiqar, Hasan
    Su, Wei
    Ding, Hui
    Liu, Li
    Lin, Hao
    BRIEFINGS IN BIOINFORMATICS, 2021, 22 (05)