Fast motif search in protein sequence databases

被引：0

作者：

Zheleva, Elena ^{[1
]}

Arslan, Abdullah N.

机构：

[1] Univ Maryland, Dept Comp Sci, College Pk, MD 20742 USA

[2] Univ Vermont, Dept Comp Sci, Burlington, VT 05405 USA

来源：

COMPUTER SCIENCE - THEORY AND APPLICATIONS | 2006年 / 3967卷

关键词：

regular expression matching; motif search; suffix tree; PROSITE pattern; heuristic; preprocessing;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Regular expression pattern matching is widely used in computational biology. Searching through a database of sequences for a motif (a simple regular expression) I or its variations is an important interactive process which requires fast motif-matching algorithms. In this paper, we explore and evaluate various represent at ions of the database of sequences using suffix trees for two types of query problems for a given regular expression: 1) Find the first match, and 2) Find all matches. Answering Problem I increases the level and effectiveness of interactive motif exploration. We propose a framework in which Problem I can be solved in a faster manner than existing solutions while not slowing down the solution of Problem 2. We apply several heuristics both at the level of suffix tree creation resulting in modified tree representations, and at the regular expression matching level in which we search subtrees in a given predefined order by simulating a deterministic finite automaton that we create from the given regular expression. The focus of our work is to develop a method for faster retrieval of PROSITE motif (a restricted regular expression) matches from a protein sequence database. We show empirically the effectiveness of our solution using several real protein data sets.

引用

页码：670 / 681

页数：12

共 50 条

[21] A protein short motif search tool using amino acid sequence and their secondary structure assignment
Venkataraman, Arun
Chew, Teong Han
Hussein, Zeti Azura Mohamed
Shamsir, Mohd Shahir
BIOINFORMATION, 2011, 7 (06) : 304 - 306
[22] COMPRESSION OF PROTEIN-SEQUENCE DATABASES
STRELETS, VB
LIM, HA
COMPUTER APPLICATIONS IN THE BIOSCIENCES, 1995, 11 (05): : 557 - 561
[23] Protein sequence and structure databases:: A review
Araúzo-Bravo, MJ
Ahmad, S
CURRENT ANALYTICAL CHEMISTRY, 2005, 1 (03) : 355 - 371
[24] Search Effectiveness in Nonredundant Sequence Databases: Assessments and Solutions
Chen, Qingyu
Zhang, Xiuzhen
Wan, Yu
Zobel, Justin
Verspoor, Karin
JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (06) : 605 - 617
[25] Fast nearest neighbor search in medical image databases
Korn, F
Sidiropoulos, N
Faloutsos, C
Siegel, E
Protopapas, Z
PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES, 1996, : 215 - 226
[26] SSAHA: A fast search method for large DNA databases
Ning, ZM
Cox, AJ
Mullikin, JC
GENOME RESEARCH, 2001, 11 (10) : 1725 - 1729
[27] BT: Fast sequence search algorithm
Jin, Bi
Rong, Gang
Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2007, 41 (04): : 621 - 625
[28] Motif-based searching in TOPS protein topology databases
Gilbert, D
Westhead, D
Nagano, N
Thornton, J
BIOINFORMATICS, 1999, 15 (04) : 317 - 326
[29] SEARCH FOR HOMOLOGIES IN NUCLEOTIDE-SEQUENCE DATABASES USING COMPUTED SEQUENCE IMAGES
FILATOV, VB
GOLOVANOV, EI
ALEKSANDROV, AA
MOLECULAR BIOLOGY, 1995, 29 (04) : 453 - 459
[30] Protein sequence motif discovery on distributed supercomputer
Challa, Santan
Thulasiraman, Parimala
ADVANCES IN GRID AND PERVASIVE COMPUTING, PROCEEDINGS, 2008, 5036 : 232 - 243

← 1 2 3 4 5 →