WildSpan: mining structured motifs from protein sequences

被引：8

作者：

Hsu, Chen-Ming ^{[2
]}

Chen, Chien-Yu ^{[1
]}

Liu, Baw-Jhiune ^{[3
]}

机构：

[1] Natl Taiwan Univ, Dept Bioind Mechatron Engn, Taipei 106, Taiwan

[2] Ching Yun Univ, Dept Comp Sci & Informat Engn, Jhongli 320, Taiwan

[3] Yuan Ze Univ, Dept Comp Sci & Engn, Jhongli 320, Taiwan

来源：

ALGORITHMS FOR MOLECULAR BIOLOGY | 2011年 / 6卷

关键词：

BIOLOGICAL SEQUENCES; EFFICIENT DISCOVERY; PATTERNS; IDENTIFICATION; CONSERVATION; PREDICTION; SIGNATURES; RESIDUES; DATABASE;

D O I：

10.1186/1748-7188-6-6

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Background: Automatic extraction of motifs from biological sequences is an important research problem in study of molecular biology. For proteins, it is desired to discover sequence motifs containing a large number of wildcard symbols, as the residues associated with functional sites are usually largely separated in sequences. Discovering such patterns is time-consuming because abundant combinations exist when long gaps (a gap consists of one or more successive wildcards) are considered. Mining algorithms often employ constraints to narrow down the search space in order to increase efficiency. However, improper constraint models might degrade the sensitivity and specificity of the motifs discovered by computational methods. We previously proposed a new constraint model to handle large wildcard regions for discovering functional motifs of proteins. The patterns that satisfy the proposed constraint model are called W-patterns. A W-pattern is a structured motif that groups motif symbols into pattern blocks interleaved with large irregular gaps. Considering large gaps reflects the fact that functional residues are not always from a single region of protein sequences, and restricting motif symbols into clusters corresponds to the observation that short motifs are frequently present within protein families. To efficiently discover W-patterns for large-scale sequence annotation and function prediction, this paper first formally introduces the problem to solve and proposes an algorithm named WildSpan (sequential pattern mining across large wildcard regions) that incorporates several pruning strategies to largely reduce the mining cost. Results: WildSpan is shown to efficiently find W-patterns containing conserved residues that are far separated in sequences. We conducted experiments with two mining strategies, protein-based and family-based mining, to evaluate the usefulness of W-patterns and performance of WildSpan. The protein-based mining mode of WildSpan is developed for discovering functional regions of a single protein by referring to a set of related sequences (e. g. its homologues). The discovered W-patterns are used to characterize the protein sequence and the results are compared with the conserved positions identified by multiple sequence alignment (MSA). The family-based mining mode of WildSpan is developed for extracting sequence signatures for a group of related proteins (e. g. a protein family) for protein function classification. In this situation, the discovered W-patterns are compared with PROSITE patterns as well as the patterns generated by three existing methods performing the similar task. Finally, analysis on execution time of running WildSpan reveals that the proposed pruning strategy is effective in improving the scalability of the proposed algorithm. Conclusions: The mining results conducted in this study reveal that WildSpan is efficient and effective in discovering functional signatures of proteins directly from sequences. The proposed pruning strategy is effective in improving the scalability of WildSpan. It is demonstrated in this study that the W-patterns discovered by WildSpan provides useful information in characterizing protein sequences. The WildSpan executable and open source codes are available on the web (http://biominer.csie.cyu.edu.tw/wildspan).

引用

页数：16

共 50 条

[31] A structural study for the optimisation of functional motifs encoded in protein sequences
Via, A
Helmer-Citterich, M
[J]. BMC BIOINFORMATICS, 2004, 5 (1)
[32] Identification of Structured Motifs
Sheng, Huitao
Mehrotra, Kishan
Mohan, Chilukuri
Raina, Ramesh
[J]. BIBMW: 2009 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOP, 2009, : 245 - +
[33] Mining and interpretation of association rules among protein sequence motifs
Kam, HJ
Lee, D
Lee, KH
[J]. PROCEEDINGS OF THE 25TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-4: A NEW BEGINNING FOR HUMAN HEALTH, 2003, 25 : 3551 - 3554
[34] Mining for class-specific motifs in protein sequence classification
Satish M Srinivasan
Suleyman Vural
Brian R King
Chittibabu Guda
[J]. BMC Bioinformatics, 14
[35] Structured motifs search
Morgante, M
Policriti, A
Vitacolonna, N
Zuccolo, A
[J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2005, 12 (08) : 1065 - 1082
[36] Detecting Motifs in DNA Sequences by Branching from Neighbors of Qualified Potential Motifs
Song, Tao
Wang, Xun
Zhang, Zhujin
Hong, Liu
[J]. JOURNAL OF COMPUTATIONAL AND THEORETICAL NANOSCIENCE, 2013, 10 (09) : 2201 - 2206
[37] Mining for class-specific motifs in protein sequence classification
Srinivasan, Satish M.
Vural, Suleyman
King, Brian R.
Guda, Chittibabu
[J]. BMC BIOINFORMATICS, 2013, 14
[38] Mining quantitative association rules in protein sequences
Gupta, N
Mangal, N
Tiwari, K
Mitra, P
[J]. DATA MINING: THEORY, METHODOLOGY, TECHNIQUES, AND APPLICATIONS, 2006, 3755 : 273 - 281
[39] Mining combinatorial data in protein sequences and structures
Saul G. Jacchieri
[J]. Molecular Diversity, 2000, 5 : 145 - 152
[40] Mining combinatorial data in protein sequences and structures
Jacchieri, SG
[J]. MOLECULAR DIVERSITY, 2000, 5 (03) : 145 - 152

← 1 2 3 4 5 →