Fast motif search in protein sequence databases

被引:0
|
作者
Zheleva, Elena [1 ]
Arslan, Abdullah N.
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
关键词
regular expression matching; motif search; suffix tree; PROSITE pattern; heuristic; preprocessing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression) I or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various represent at ions of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem I increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem I can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.
引用
收藏
页码:670 / 681
页数:12
相关论文
共 50 条
  • [1] Fast search in DNA sequence databases using punctuation and indexing
    Lu, Yi
    Lu, Shiyong
    Ram, Jeffrey L.
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTER SCIENCE AND TECHNOLOGY, 2006, : 351 - +
  • [2] Fast Top-k Similar Sequence Search on DNA Databases
    Yagi, Ryuichi
    Shiokawa, Hiroaki
    INFORMATION INTEGRATION AND WEB INTELLIGENCE, IIWAS 2022, 2022, 13635 : 145 - 150
  • [3] Accelerating approximate subsequence search on large protein sequence databases
    Yang, J
    Wang, W
    Xia, Y
    Yu, PS
    CSB2002: IEEE COMPUTER SOCIETY BIOINFORMATICS CONFERENCE, 2002, : 207 - 216
  • [4] kClust: fast and sensitive clustering of large protein sequence databases
    Maria Hauser
    Christian E Mayer
    Johannes Söding
    BMC Bioinformatics, 14
  • [5] kClust: fast and sensitive clustering of large protein sequence databases
    Hauser, Maria
    Mayer, Christian E.
    Soeding, Johannes
    BMC BIOINFORMATICS, 2013, 14
  • [6] Fast and Scalable Protein Motif Sequence Clustering based on Hadoop Framework
    Farhangi, Erfan
    Ghadiri, Nasser
    Asadi, Mahsa
    Nikbakht, Mohammad Amin
    Pitre, Sylvain
    2017 3RD INTERNATIONAL CONFERENCE ON WEB RESEARCH (ICWR), 2017, : 24 - 31
  • [7] Protein sequence databases
    Apweiler, R
    ADVANCES IN PROTEIN CHEMISTRY, VOL 54, 2000, 54 : 31 - 71
  • [8] Protein sequence databases
    Apweiler, R
    Bairoch, A
    Wu, CH
    CURRENT OPINION IN CHEMICAL BIOLOGY, 2004, 8 (01) : 76 - 80
  • [9] Fast similarity search in string databases
    Sheu, S
    Chang, A
    Huang, W
    19TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION NETWORKING AND APPLICATIONS, VOL 1, PROCEEDINGS: AINA 2005, 2005, : 617 - 622
  • [10] Fast approximate search in text databases
    Shi, F
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 259 - 267