Homology induction: the use of machine learning to improve sequence similarity searches

被引:20
|
作者
Karwath, A [1 ]
King, RD [1 ]
机构
[1] Univ Coll Wales, Dept Comp Sci, Aberystwyth SY23 3DB, Dyfed, Wales
关键词
D O I
10.1186/1471-2105-3-11
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify similar to50% of homologies (with a false positive rate set at I/ 1000). Results: We present Homology Induction (HI), a new approach to inferring homology. HI uses machine learning to bootstrap from standard sequence similarity search methods. First a standard method is run, then HI learns rules which are true for sequences of high similarity to the target (assumed homologues) and not true for general sequences, these rules are then used to discriminate sequences in the twilight zone. To learn the rules HI describes the sequences in a novel way based on a bioinformatic knowledge base, and the machine learning method of inductive logic programming. To evaluate HI we used the PDB40D benchmark which lists sequences of known homology but low sequence similarity. We compared the H I methodoly with PSI-BLAST alone and found HI performed significantly better. In addition, Receiver Operating Characteristic (ROC) curve analysis showed that these improvements were robust for all reasonable error costs. The predictive homology rules learnt by HI by can be interpreted biologically to provide insight into conserved features of homologous protein families. Conclusions: HI is a new technique for the detection of remote protein homolgy - a central bioinformatic problem. HI with PSI-BLAST is shown to outperform PSI-BLAST for all error costs. It is expect that similar improvements would be obtained using HI with any sequence similarity method.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Homology Induction: the use of machine learning to improve sequence similarity searches
    Andreas Karwath
    Ross D King
    [J]. BMC Bioinformatics, 3
  • [2] Use of Machine Learning for Improvement of Similarity Searches of Patients
    Petrovan, B.
    Orza, B.
    Vlaicu, A.
    [J]. INTERNATIONAL CONFERENCE ON ADVANCEMENTS OF MEDICINE AND HEALTH CARE THROUGH TECHNOLOGY, MEDITECH 2016, 2017, 59 : 252 - 255
  • [3] USE OF HOMOLOGY DOMAINS IN SEQUENCE SIMILARITY DETECTION
    LAWRENCE, CB
    [J]. METHODS IN ENZYMOLOGY, 1990, 183 : 133 - 146
  • [4] PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL
    Bramucci, Emanuele
    Paiardini, Alessandro
    Bossa, Francesco
    Pascarella, Stefano
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [5] PyMod: sequence similarity searches, multiple sequence-structure alignments, and homology modeling within PyMOL
    Emanuele Bramucci
    Alessandro Paiardini
    Francesco Bossa
    Stefano Pascarella
    [J]. BMC Bioinformatics, 13
  • [6] Faster sequence homology searches by clustering subsequences
    Suzuki, Shuji
    Kakuta, Masanori
    Ishida, Takashi
    Akiyama, Yutaka
    [J]. BIOINFORMATICS, 2015, 31 (08) : 1183 - 1190
  • [7] Detecting false positive sequence homology: a machine learning approach
    M. Stanley Fujimoto
    Anton Suvorov
    Nicholas O. Jensen
    Mark J. Clement
    Seth M. Bybee
    [J]. BMC Bioinformatics, 17
  • [8] Detecting false positive sequence homology: a machine learning approach
    Fujimoto, M. Stanley
    Suvorov, Anton
    Jensen, Nicholas O.
    Clement, Mark J.
    Bybee, Seth M.
    [J]. BMC BIOINFORMATICS, 2016, 17
  • [9] Boosting descriptors for similarity searches: Feature trees trained by machine learning.
    Gastreich, M
    Liao, J
    Hessler, G
    Pfeiffer-Marek, S
    Hindle, SA
    Warmuth, M
    Lemmen, C
    Naumann, T
    Baringhaus, KH
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2005, 229 : U609 - U609
  • [10] Empirical statistical estimates for sequence similarity searches
    Pearson, WR
    [J]. JOURNAL OF MOLECULAR BIOLOGY, 1998, 276 (01) : 71 - 84