Fast motif search in protein sequence databases

被引:0
|
作者
Zheleva, Elena [1 ]
Arslan, Abdullah N.
机构
[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA
[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA
关键词
regular expression matching; motif search; suffix tree; PROSITE pattern; heuristic; preprocessing;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression) I or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various represent at ions of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem I increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem I can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.
引用
收藏
页码:670 / 681
页数:12
相关论文
共 50 条
  • [21] A protein short motif search tool using amino acid sequence and their secondary structure assignment
    Venkataraman, Arun
    Chew, Teong Han
    Hussein, Zeti Azura Mohamed
    Shamsir, Mohd Shahir
    BIOINFORMATION, 2011, 7 (06) : 304 - 306
  • [22] COMPRESSION OF PROTEIN-SEQUENCE DATABASES
    STRELETS, VB
    LIM, HA
    COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1995, 11 (05): : 557 - 561
  • [23] Protein sequence and structure databases:: A review
    Araúzo-Bravo, MJ
    Ahmad, S
    CURRENT ANALYTICAL CHEMISTRY, 2005, 1 (03) : 355 - 371
  • [24] Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions
    Chen, Qingyu
    Zhang, Xiuzhen
    Wan, Yu
    Zobel, Justin
    Verspoor, Karin
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (06) : 605 - 617
  • [25] Fast nearest neighbor search in medical image databases
    Korn, F
    Sidiropoulos, N
    Faloutsos, C
    Siegel, E
    Protopapas, Z
    PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1996, : 215 - 226
  • [26] SSAHA: A fast search method for large DNA databases
    Ning, ZM
    Cox, AJ
    Mullikin, JC
    GENOME RESEARCH, 2001, 11 (10) : 1725 - 1729
  • [27] BT: Fast sequence search algorithm
    Jin, Bi
    Rong, Gang
    Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2007, 41 (04): : 621 - 625
  • [28] Motif-based searching in TOPS protein topology databases
    Gilbert, D
    Westhead, D
    Nagano, N
    Thornton, J
    BIOINFORMATICS, 1999, 15 (04) : 317 - 326
  • [29] SEARCH FOR HOMOLOGIES IN NUCLEOTIDE-SEQUENCE DATABASES USING COMPUTED SEQUENCE IMAGES
    FILATOV, VB
    GOLOVANOV, EI
    ALEKSANDROV, AA
    MOLECULAR BIOLOGY, 1995, 29 (04) : 453 - 459
  • [30] Protein sequence motif discovery on distributed supercomputer
    Challa, Santan
    Thulasiraman, Parimala
    ADVANCES IN GRID AND PERVASIVE COMPUTING, PROCEEDINGS, 2008, 5036 : 232 - 243