A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions

被引:1
|
作者
Abnousi, Armen [1 ]
Broschat, Shira L. [1 ,2 ,3 ]
Kalyanaraman, Ananth [1 ,2 ]
机构
[1] Washington State Univ, Sch EECS, Pullman, WA 99164 USA
[2] Washington State Univ, Paul G Allen Sch Global Anim Hlth, Pullman, WA 99164 USA
[3] Washington State Univ, Dept Vet Microbiol & Pathol, Pullman, WA 99164 USA
来源
PLOS ONE | 2016年 / 11卷 / 08期
基金
美国国家科学基金会;
关键词
DOMAIN PREDICTION; IDENTIFICATION; DATABASE; TOOL;
D O I
10.1371/journal.pone.0161338
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. Methods In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. Results We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.
引用
收藏
页数:19
相关论文
共 50 条
  • [21] Synsor: a tool for alignment-free detection of engineered DNA sequences
    Tay, Aidan P.
    Didi, Kieran
    Wickramarachchi, Anuradha
    Bauer, Denis C.
    Wilson, Laurence O. W.
    Maselko, Maciej
    FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2024, 12
  • [22] A novel alignment-free vector method to cluster protein sequences
    He, Lily
    Li, Yongkun
    He, Rong Lucy
    Yau, Stephen S. T.
    JOURNAL OF THEORETICAL BIOLOGY, 2017, 427 : 41 - 52
  • [23] A Novel Alignment-Free Method for Phylogenetic Analysis of Protein Sequences
    Zhang, Shengli
    Wang, Tianming
    SELECTED TOPICS IN APPLIED COMPUTER SCIENCE, 2010, : 67 - +
  • [24] A new distributed alignment-free approach to compare whole proteomes
    Petrillo, Umberto Ferraro
    Guerra, Concettina
    Pizzi, Cinzia
    THEORETICAL COMPUTER SCIENCE, 2017, 698 : 100 - 112
  • [25] Fast alignment-free sequence comparison using spaced-word frequencies
    Leimeister, Chris-Andre
    Boden, Marcus
    Horwege, Sebastian
    Lindner, Sebastian
    Morgenstern, Burkhard
    BIOINFORMATICS, 2014, 30 (14) : 1991 - 1999
  • [26] Alignment-free Sparse Representation based Classification method via Fast Location
    He, Jun
    Li, Cheng
    Sun, Bo
    Wu, Xuewen
    Ge, Fengxiang
    2014 4TH IEEE INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND TECHNOLOGY (ICIST), 2014, : 510 - 514
  • [27] Alignment-free detection of local similarity among viral and bacterial genomes
    Domazet-Loso, Mirjana
    Haubold, Bernhard
    BIOINFORMATICS, 2011, 27 (11) : 1466 - 1472
  • [28] KTYPER: FAST AND ACCURATE ALIGNMENT-FREE HLA GENOTYPING WITH NANOPORE SEQUENCE DATA
    Klasberg, Steffen
    Putke, Kathrin
    Fuhrmann, Markus
    Surendranath, Vineeth
    Schmidt, Alexander H.
    Lange, Vinzenz
    Schoefl, Gerhard
    HLA, 2020, 95 (04) : 305 - 305
  • [29] Alignment-free similarity analysis for protein sequences based on fuzzy integral
    Saw, Ajay Kumar
    Tripathy, Binod Chandra
    Nandi, Soumyadeep
    SCIENTIFIC REPORTS, 2019, 9 (1)
  • [30] Alignment-free similarity analysis for protein sequences based on fuzzy integral
    Ajay Kumar Saw
    Binod Chandra Tripathy
    Soumyadeep Nandi
    Scientific Reports, 9