Fast motif search in protein sequence databases

被引:0
|
作者
Zheleva, Elena [1 ]
Arslan, Abdullah N.
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
关键词
regular expression matching; motif search; suffix tree; PROSITE pattern; heuristic; preprocessing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression) I or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various represent at ions of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem I increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem I can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.
引用
收藏
页码:670 / 681
页数:12
相关论文
共 50 条
  • [31] THE ELUCIDATION OF PROTEIN FUNCTION BY SEQUENCE MOTIF ANALYSIS
    HODGMAN, TC
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1989, 5 (01): : 1 - 13
  • [32] Construction of Protein Sequence Databases for Metaproteomics: A Review of the Current Tools and Databases
    Arikan, Muzaffer
    Atabay, Basak
    JOURNAL OF PROTEOME RESEARCH, 2024, 23 (12) : 5250 - 5262
  • [33] Fast similarity search for protein 3D structure databases using spatial topological patterns
    Park, SH
    Ryu, KH
    DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2004, 3180 : 771 - 780
  • [34] The EBI's nucleotide and protein sequence databases
    O'Donovan, C
    CYTOGENETICS AND CELL GENETICS, 1999, 85 (1-2): : 12 - 12
  • [35] A SEQUENCE PROPERTY APPROACH TO SEARCHING PROTEIN DATABASES
    HOBOHM, U
    SANDER, C
    JOURNAL OF MOLECULAR BIOLOGY, 1995, 251 (03) : 390 - 399
  • [36] SEARCHING GENE AND PROTEIN-SEQUENCE DATABASES
    BARSALOU, T
    BRUTLAG, DL
    M D COMPUTING, 1991, 8 (03): : 144 - 149
  • [37] PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
    Liu, Wei
    Wang, Ziye
    You, Ronghui
    Xie, Chenghan
    Wei, Hong
    Xiong, Yi
    Yang, Jianyi
    Zhu, Shanfeng
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [38] Fast and practical algorithms for planted (l, d) motif search
    Davila, Jaime
    Balla, Sudha
    Rajasekaran, Sanguthevar
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2007, 4 (04) : 544 - 552
  • [39] SimSearcher: A local similarity search engine for biological sequence databases
    Tsai, TH
    Lee, SY
    IEEE FIFTH INTERNATIOANL SYMPOSIUM ON MULTIMEDIA SOFTWARE ENGINEERING, PROCEEDINGS, 2003, : 305 - 312
  • [40] Piers: An efficient model for similarity search in DNA sequence databases
    Cao, X
    Li, SC
    Ooi, BC
    Tung, AKH
    SIGMOD RECORD, 2004, 33 (02) : 39 - 44