Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

被引:3
|
作者
Han, Seong Kyu [1 ,5 ]
Muto, Yoshiharu [2 ]
Wilson, Parker C. [3 ]
Humphreys, Benjamin D. [2 ,4 ]
Sampson, Matthew G. [1 ,5 ]
Chakravarti, Aravinda [6 ]
Lee, Dongwo [1 ,7 ]
机构
[1] Boston & Harvard Med Sch, Boston Childrens Hosp, Dept Pediat, Div Nephrol, Boston, MA 02115 USA
[2] Washington Univ St Louis, Dept Med, Div Nephrol, St Louis, MO 63130 USA
[3] Washington Univ St Louis, Dept Pathol & munol, St Louis, MO 63130 USA
[4] Washington Univ St Louis, Dept Dev Biol, St Louis, MO 63130 USA
[5] Broad Inst & Harvard, Kidney Dis Initiat, Cambridge, MA 02142 USA
[6] New York Univ, Ctr Human Genet & Genom, Grossman Sch Med, New York, NY 10016 USA
[7] Boston Childrens Hosp, Manton Ctr Orphan Res, Boston, MA 02115 USA
关键词
quality control; chromatin accessibility; sequence-based model; gkmQC; GENOME-WIDE ASSOCIATION; BINDING PROTEINS; DNA; VISUALIZATION; ENHANCERS; VARIANTS; ENCODE; LMX1B; CHIP;
D O I
10.1073/pnas.2212810119
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) sam-ples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Improving sequence-based fold recognition by using 3D model quality assessment
    Pettitt, CS
    McGuffin, LJ
    Jones, DT
    BIOINFORMATICS, 2005, 21 (17) : 3509 - 3515
  • [2] TRACE: transcription factor footprinting using chromatin accessibility data and DNA sequence
    Ouyang, Ningxin
    Boyle, Alan P.
    GENOME RESEARCH, 2020, 30 (07) : 1040 - 1046
  • [3] Sequence-based predictive modeling to identify cancerlectins
    Lai, Hong-Yan
    Chen, Xin-Xin
    Chen, Wei
    Tang, Hua
    Lin, Hao
    ONCOTARGET, 2017, 8 (17) : 28169 - 28175
  • [4] Improving sequence-based modeling of protein families using secondary-structure quality assessment
    Malbranke, Cyril
    Bikard, David
    Cocco, Simona
    Monasson, Remi
    BIOINFORMATICS, 2021, 37 (22) : 4083 - 4090
  • [5] Incorporating Chromatin Accessibility Data into Sequence-to-Expression Modeling
    Peng, Pei-Chen
    Samee, Md. Abul Hassan
    Sinha, Saurabh
    BIOPHYSICAL JOURNAL, 2015, 108 (05) : 1257 - 1267
  • [6] Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information
    Cullen Roth
    Vrinda Venu
    Vanessa Job
    Nicholas Lubbers
    Karissa Y. Sanbonmatsu
    Christina R. Steadman
    Shawn R. Starkenburg
    BMC Bioinformatics, 24
  • [7] Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information
    Roth, Cullen
    Venu, Vrinda
    Job, Vanessa
    Lubbers, Nicholas
    Sanbonmatsu, Karissa Y.
    Steadman, Christina R.
    Starkenburg, Shawn R.
    BMC BIOINFORMATICS, 2023, 24 (01)
  • [8] Regional and Single Nucleotide Correction of Sequence Bias in Chromatin Accessibility Data
    Wolpe, Jacob
    Guertin, Michael
    FASEB JOURNAL, 2022, 36
  • [9] SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models
    Gjoni, Ketrin
    Pollard, Katherine S.
    BIOINFORMATICS, 2024, 40 (06)
  • [10] Predictive Switching Sequence-based Control for Constant Power Load
    Chatterjee, Debanjan
    Mazumder, Sudip K.
    2019 IEEE ENERGY CONVERSION CONGRESS AND EXPOSITION (ECCE), 2019, : 1574 - 1583