Quality assessment and refinement of chromatin accessibility data using a sequence-based predictive model

被引:3
|
作者
Han, Seong Kyu [1 ,5 ]
Muto, Yoshiharu [2 ]
Wilson, Parker C. [3 ]
Humphreys, Benjamin D. [2 ,4 ]
Sampson, Matthew G. [1 ,5 ]
Chakravarti, Aravinda [6 ]
Lee, Dongwo [1 ,7 ]
机构
[1] Boston & Harvard Med Sch, Boston Childrens Hosp, Dept Pediat, Div Nephrol, Boston, MA 02115 USA
[2] Washington Univ St Louis, Dept Med, Div Nephrol, St Louis, MO 63130 USA
[3] Washington Univ St Louis, Dept Pathol & munol, St Louis, MO 63130 USA
[4] Washington Univ St Louis, Dept Dev Biol, St Louis, MO 63130 USA
[5] Broad Inst & Harvard, Kidney Dis Initiat, Cambridge, MA 02142 USA
[6] New York Univ, Ctr Human Genet & Genom, Grossman Sch Med, New York, NY 10016 USA
[7] Boston Childrens Hosp, Manton Ctr Orphan Res, Boston, MA 02115 USA
关键词
quality control; chromatin accessibility; sequence-based model; gkmQC; GENOME-WIDE ASSOCIATION; BINDING PROTEINS; DNA; VISUALIZATION; ENHANCERS; VARIANTS; ENCODE; LMX1B; CHIP;
D O I
10.1073/pnas.2212810119
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Chromatin accessibility assays are central to the genome-wide identification of gene regulatory elements associated with transcriptional regulation. However, the data have highly variable quality arising from several biological and technical factors. To surmount this problem, we developed a sequence-based machine learning method to evaluate and refine chromatin accessibility data. Our framework, gapped k-mer SVM quality check (gkmQC), provides the quality metrics for a sample based on the prediction accuracy of the trained models. We tested 886 DNase-seq samples from the ENCODE/Roadmap projects to demonstrate that gkmQC can effectively identify "high-quality" (HQ) sam-ples with low conventional quality scores owing to marginal read depths. Peaks identified in HQ samples are more accurately aligned at functional regulatory elements, show greater enrichment of regulatory elements harboring functional variants, and explain greater heritability of phenotypes from their relevant tissues. Moreover, gkmQC can optimize the peak-calling threshold to identify additional peaks, especially for rare cell types in single-cell chromatin accessibility data.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] A Sequence-Based Prediction Model of Vesicular Transport Proteins Using Ensemble Deep Learning
    Le, Nguyen Quoc Khanh
    Kha, Quang Hien
    14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [32] Using genomic databases for sequence-based biological discovery
    Baxevanis, AD
    MOLECULAR MEDICINE, 2003, 9 (9-12) : 185 - 192
  • [33] B-factor prediction in proteins using a sequence-based deep learning model
    Pandey, Akash
    Liu, Elaine
    Graham, Jacob
    Chen, Wei
    Keten, Sinan
    PATTERNS, 2023, 4 (09):
  • [34] Using Genomic Databases for Sequence-Based Biological Discovery
    Andreas D Baxevanis
    Molecular Medicine, 2003, 9 : 185 - 192
  • [35] Multilocus phylogeography and phylogenetics using sequence-based markers
    Brito, Patricia H.
    Edwards, Scott V.
    GENETICA, 2009, 135 (03) : 439 - 455
  • [36] Multilocus phylogeography and phylogenetics using sequence-based markers
    Patrícia H. Brito
    Scott V. Edwards
    Genetica, 2009, 135 : 439 - 455
  • [37] Discovery of therapeutic targets in cancer using chromatin accessibility and transcriptomic data
    Forbes, Andre Neil
    Xu, Duo
    Cohen, Sandra
    Pancholi, Priya
    Khurana, Ekta
    CELL SYSTEMS, 2024, 15 (09)
  • [38] Call for a Quality Standard for Sequence-Based Assays in Clinical Microbiology: Necessity for Quality Assessment of Sequences Used in Microbial Identification and Typing
    Underwood, Anthony
    Green, Jonathan
    JOURNAL OF CLINICAL MICROBIOLOGY, 2011, 49 (01) : 23 - 26
  • [39] Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme
    Koohy, Hashem
    Down, Thomas A.
    Hubbard, Tim J.
    PLOS ONE, 2013, 8 (07):
  • [40] A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides
    Xu, Lei
    Liang, Guangmin
    Wang, Longjie
    Liao, Changrui
    GENES, 2018, 9 (03)