Biological sequence classification utilizing positive and unlabeled data

被引:7
|
作者
Xiao, Yuanyuan [1 ]
Segal, Mark R. [1 ]
机构
[1] Univ Calif San Francisco, Dept Epidemiol & Biostat, Ctr Bioinformat & Mol Biostat, San Francisco, CA 94107 USA
关键词
D O I
10.1093/bioinformatics/btn089
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: In the genomics setting, an increasingly common data configuration consists of a small set of sequences possessing a targeted property (positive instances) amongst a large set of sequences for which class membership is unknown (unlabeled instances). Traditional two-class classification methods do not effectively handle such data. Results: Here, we develop a novel method, likely positive-iterative classification (LP-IC) for this problem, and contrast its performance with the few existing methods, most of which were devised and utilized in the text classification context. LP-IC employs an iterative classification scheme and introduces a class dispersion measure, adopted from unsupervised clustering approaches, to monitor the model selection process. Using two case studiesprediction of HLA binding, and alternative splicing conservation between human and mousewe show that LP-IC provides superior performance to existing methodologies in terms of: (i) combined accuracy and precision in positive identification from the unlabeled set; and (ii) predictive performance of the resultant classifiers on independent test data. Contact: mark@biostat.ucsf.edu Supplementary information: Supplementary data are available at Bioinformatics online.
引用
收藏
页码:1198 / 1205
页数:8
相关论文
共 50 条
  • [31] Learning from positive and unlabeled data: a survey
    Bekker, Jessa
    Davis, Jesse
    [J]. MACHINE LEARNING, 2020, 109 (04) : 719 - 760
  • [32] Learning from positive and unlabeled data: a survey
    Jessa Bekker
    Jesse Davis
    [J]. Machine Learning, 2020, 109 : 719 - 760
  • [33] Automatic webpage classification enhanced by unlabeled data
    Park, SB
    Zhang, BT
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING, 2003, 2690 : 821 - 825
  • [34] IMPROVING THE CLASSIFICATION OF RARE CHORDS WITH UNLABELED DATA
    Bortolozzo, Marcelo
    Schramm, Rodrigo
    Jung, Claudio R.
    [J]. 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 3390 - 3394
  • [35] Learning to Integrate Unlabeled Data in Text Classification
    Jiang, Eric P.
    [J]. ICCSIT 2010 - 3RD IEEE INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, VOL 4, 2010, : 82 - 86
  • [36] Classification from Pairwise Similarity and Unlabeled Data
    Bao, Han
    Niu, Gang
    Sugiyama, Masashi
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 80, 2018, 80
  • [37] Learning classification with both labeled and unlabeled data
    Vittaut, JN
    Amini, MR
    Gallinari, P
    [J]. MACHINE LEARNING: ECML 2002, 2002, 2430 : 468 - 479
  • [38] Combining labeled and unlabeled data for spam classification
    Yang, Zhen
    Wang, Jian
    Xu, Weiran
    Guo, Jun
    [J]. DYNAMICS OF CONTINUOUS DISCRETE AND IMPULSIVE SYSTEMS-SERIES B-APPLICATIONS & ALGORITHMS, 2007, 14 : 1476 - 1479
  • [39] Improving Researcher Homepage Classification with Unlabeled Data
    Das Gollapalli, Sujatha
    Caragea, Cornelia
    Mitra, Prasenjit
    Giles, C. Lee
    [J]. ACM TRANSACTIONS ON THE WEB, 2015, 9 (04)
  • [40] Learning from data streams with only positive and unlabeled data
    Qin, Xiangju
    Zhang, Yang
    Li, Chen
    Li, Xue
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2013, 40 (03) : 405 - 430