Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique

被引:14
|
作者
Bhardwaj, Nitin [1 ,2 ]
Gerstein, Mark [2 ,3 ,4 ]
Lu, Hui [1 ]
机构
[1] Univ Illinois, Bioinformat Program, Dept Bioengn, Chicago, IL 60607 USA
[2] Yale Univ, Program Computat Biol & Bioinformat, New Haven, CT 06520 USA
[3] Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA
[4] Yale Univ, Dept Comp Sci, New Haven, CT 06520 USA
来源
BMC BIOINFORMATICS | 2010年 / 11卷
基金
美国国家科学基金会;
关键词
MEMBRANE; BINDING;
D O I
10.1186/1471-2105-11-S1-S6
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: In supervised learning, traditional approaches to building a classifier use two sets of examples with pre-defined classes along with a learning algorithm. The main limitation of this approach is that examples from both classes are required which might be infeasible in certain cases, especially those dealing with biological data. Such is the case for membrane-binding peripheral domains that play important roles in many biological processes, including cell signaling and membrane trafficking by reversibly binding to membranes. For these domains, a well-defined positive set is available with domains known to bind membrane along with a large unlabeled set of domains whose membrane binding affinities have not been measured. The aforementioned limitation can be addressed by a special class of semi-supervised machine learning called positive-unlabeled (PU) learning that uses a positive set with a large unlabeled set. Methods: In this study, we implement the first application of PU-learning to a protein function prediction problem: identification of peripheral domains. PU-learning starts by identifying reliable negative (RN) examples iteratively from the unlabeled set until convergence and builds a classifier using the positive and the final RN set. A data set of 232 positive cases and similar to 3750 unlabeled ones were used to construct and validate the protocol. Results: Holdout evaluation of the protocol on a left-out positive set showed that the accuracy of prediction reached up to 95% during two independent implementations. Conclusion: These results suggest that our protocol can be used for predicting membrane-binding properties of a wide variety of modular domains. Protocols like the one presented here become particularly useful in the case of availability of information from one class only.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique
    Nitin Bhardwaj
    Mark Gerstein
    Hui Lu
    [J]. BMC Bioinformatics, 11
  • [2] Sequence-based prediction of single nucleosome positioning and genome-wide nucleosome occupancy
    van der Heijden, Thijn
    van Vugt, Joke J. F. A.
    Logie, Colin
    van Noort, John
    [J]. PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2012, 109 (38) : E2514 - E2522
  • [3] DeepHeart: Semi-Supervised Sequence Learning for Cardiovascular Risk Prediction
    Ballinger, Brandon
    Hsieh, Johnson
    Singh, Avesh
    Sohoni, Nimit
    Wang, Jack
    Tison, Geoffrey H.
    Marcus, Gregory M.
    Sanchez, Jose M.
    Maguire, Carol
    Olgin, Jeffrey E.
    Pletcher, Mark J.
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 2079 - 2086
  • [4] Asymmetrical semi-supervised learning and prediction of disulfide connectivity in proteins
    Laboratoire d'Informatique Fondamentale , UMR CNRS 6166, Université de Provence
    [J]. Rev Intell Artif, 2006, 6 (673-695):
  • [5] Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences
    Mourad, Raphael
    [J]. BMC BIOINFORMATICS, 2023, 24 (01)
  • [6] Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences
    Raphaël Mourad
    [J]. BMC Bioinformatics, 24
  • [7] Increasing the accuracy of single sequence prediction methods using a deep semi-supervised learning framework
    Moffat, Lewis
    Jones, David T.
    [J]. BIOINFORMATICS, 2021, 37 (21) : 3744 - 3751
  • [8] Genome-wide prediction of prokaryotic two-component system networks using a sequence-based meta-predictor
    Altan Kara
    Martin Vickers
    Martin Swain
    David E. Whitworth
    Narcis Fernandez-Fuentes
    [J]. BMC Bioinformatics, 16
  • [9] Genome-wide prediction of prokaryotic two-component system networks using a sequence-based meta-predictor
    Kara, Altan
    Vickers, Martin
    Swain, Martin
    Whitworth, David E.
    Fernandez-Fuentes, Narcis
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [10] Protein Function Prediction Based on Active Semi-supervised Learning
    WANG Xuesong
    CHENG Yuhu
    LI Lijing
    [J]. Chinese Journal of Electronics, 2016, 25 (04) : 595 - 600