Biological sequence classification utilizing positive and unlabeled data

被引:7
|
作者
Xiao, Yuanyuan [1 ]
Segal, Mark R. [1 ]
机构
[1] Univ Calif San Francisco, Dept Epidemiol & Biostat, Ctr Bioinformat & Mol Biostat, San Francisco, CA 94107 USA
关键词
D O I
10.1093/bioinformatics/btn089
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studiesprediction of HLA binding, and alternative splicing conservation between human and mousewe show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data. Contact: mark@biostat.ucsf.edu Supplementary information: Supplementary data are available at Bioinformatics online.
引用
收藏
页码:1198 / 1205
页数:8
相关论文
共 50 条
  • [1] Hybrid local boosting utilizing unlabeled data in classification tasks
    Christos K. Aridas
    Sotiris B. Kotsiantis
    Michael N. Vrahatis
    [J]. Evolving Systems, 2019, 10 : 51 - 61
  • [2] Hybrid local boosting utilizing unlabeled data in classification tasks
    Aridas, Christos K.
    Kotsiantis, Sotiris B.
    Vrahatis, Michael N.
    [J]. EVOLVING SYSTEMS, 2019, 10 (01) : 51 - 61
  • [3] CLASSIFICATION FROM ONLY POSITIVE AND UNLABELED FUNCTIONAL DATA
    Terada, Yoshikazu
    Ogasawara, Issei
    Nakata, Ken
    [J]. ANNALS OF APPLIED STATISTICS, 2020, 14 (04): : 1724 - 1742
  • [4] Classification from Positive, Unlabeled and Biased Negative Data
    Hsieh, Yu-Guan
    Niu, Gang
    Sugiyama, Masashi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [5] Semi-Supervised Classification Based on Classification from Positive and Unlabeled Data
    Sakai, Tomoya
    du Plessis, Marthinus Christoffel
    Niu, Gang
    Sugiyama, Masashi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 70, 2017, 70
  • [6] Efficient use of unlabeled data for protein sequence classification: a comparative study
    Pavel Kuksa
    Pai-Hsi Huang
    Vladimir Pavlovic
    [J]. BMC Bioinformatics, 10
  • [7] Efficient use of unlabeled data for protein sequence classification: a comparative study
    Kuksa, Pavel
    Huang, Pai-Hsi
    Pavlovic, Vladimir
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [8] BINARY CLASSIFICATION ONLY FROM UNLABELED DATA BY ITERATIVE UNLABELED-UNLABELED CLASSIFICATION
    Kaji, Hirotaka
    Sugiyama, Masashi
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 3527 - 3531
  • [9] Semi-supervised text classification using positive and unlabeled data
    Yu, Shuang
    Zhou, Xueyuan
    Li, Chunping
    [J]. ADVANCES IN INTELLIGENT IT: ACTIVE MEDIA TECHNOLOGY 2006, 2006, 138 : 249 - 254
  • [10] Classification from positive and unlabeled data based on likelihood invariance for measurement
    Yoshida, Takeshi
    Washio, Takashi
    Ohshiro, Takahito
    Taniguchi, Masateru
    [J]. INTELLIGENT DATA ANALYSIS, 2021, 25 (01) : 57 - 79