Improving taxonomic classification with feature space balancing

被引:0
|
作者
Fuhl, Wolfgang [1 ]
Zabel, Susanne [1 ]
Nieselt, Kay [1 ]
机构
[1] Univ Tubingen, Inst Biomed Informat IBMI, Sand 14, D-72076 Tubingen, Baden Wurttembe, Germany
来源
BIOINFORMATICS ADVANCES | 2023年 / 3卷 / 01期
关键词
METAGENOMICS;
D O I
10.1093/bioadv/vbad092
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Modern high-throughput sequencing technologies, such as metagenomic sequencing, generate millions of sequences that need to be assigned to their taxonomic rank. Modern approaches either apply local alignment to existing databases, such as MMseqs2, or use deep neural networks, as in DeepMicrobes and BERTax. Due to the increasing size of datasets and databases, alignment-based approaches are expensive in terms of runtime. Deep learning-based approaches can require specialized hardware and consume large amounts of energy. In this article, we propose to use k-mer profiles of DNA sequences as features for taxonomic classification. Although k-mer profiles have been used before, we were able to significantly increase their predictive power significantly by applying a feature space balancing approach to the training data. This greatly improved the generalization quality of the classifiers. We have implemented different pipelines using our proposed feature extraction and dataset balancing in combination with different simple classifiers, such as bagged decision trees or feature subspace KNNs. By comparing the performance of our pipelines with state-of-the-art algorithms, such as BERTax and MMseqs2 on two different datasets, we show that our pipelines outperform these in almost all classification tasks. In particular, sequences from organisms that were not part of the training were classified with high precision.
引用
收藏
页数:7
相关论文
共 50 条
  • [41] Improved email classification through enriched feature space
    Ye, YM
    Ma, FY
    Rong, HQ
    Huang, JZ
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 489 - 498
  • [42] Nonlinear feature extraction and classification of multivariate process data in kernel feature space
    Cho, Hyun-Woo
    EXPERT SYSTEMS WITH APPLICATIONS, 2007, 32 (02) : 534 - 542
  • [43] ADHD classification by feature space separation with sparse representation
    Zhang, Yan
    Tang, Yibin
    Chen, Ying
    Zhou, Lin
    Wang, Chun
    2018 IEEE 23RD INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING (DSP), 2018,
  • [44] Weighted Feature Space Representation with Kernel for Image Classification
    Yongbin Qin
    Chunwei Tian
    Arabian Journal for Science and Engineering, 2018, 43 : 7113 - 7125
  • [45] Classification of Categorical Data in the Feature Space of Monotone DNFs
    Polato, Mirko
    Lauriola, Ivano
    Aiolli, Fabio
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, PT II, 2017, 10614 : 279 - 286
  • [46] Enhanced email classification based on feature space enriching
    Ye, YM
    Ma, FY
    Rong, HQ
    Huang, J
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, 2004, 3136 : 299 - 311
  • [47] A systematical approach to classification problems with feature space heterogeneity
    Xiao, Hongshan
    Wang, Yu
    KYBERNETES, 2019, 48 (09) : 2006 - 2029
  • [48] Visualizing High Dimensional Feature Space for Feature-Based Information Classification
    Wang, Xiaokun
    Yang, Li
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2016, PT II, 2016, 9787 : 540 - 550
  • [49] Short Text Classification Improved by Feature Space Extension
    Li, Yanxuan
    2019 THE 5TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, CONTROL AND ROBOTICS (EECR 2019), 2019, 533
  • [50] Feature space partition: a local–global approach for classification
    C. G. Marcelino
    C. E. Pedreira
    Neural Computing and Applications, 2022, 34 : 21877 - 21890