MatchXML: An Efficient Text-Label Matching Framework for Extreme Multi-Label Text Classification

被引:0
|
作者
Ye, Hui [1 ]
Sunderraman, Rajshekhar [1 ]
Ji, Shihao [1 ]
机构
[1] Georgia State Univ, Dept Comp Sci, Atlanta, GA 30302 USA
基金
美国国家科学基金会;
关键词
Training; Transformers; Task analysis; Vectors; Self-supervised learning; Text categorization; Semantics; Extreme multi-label classification; label2vec; text-label matching; bipartite graph; contrastive learning;
D O I
10.1109/TKDE.2024.3374750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The eXtreme Multi-label text Classification (XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency (TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations, and static dense sentence features. Experimental results demonstrate that MatchXML achieves the state-of-the-art accuracies on five out of six datasets. As for the training speed, MatchXML outperforms the competing methods on all the six datasets.
引用
收藏
页码:4781 / 4793
页数:13
相关论文
共 50 条
  • [21] Multi-label arabic text classification: an overview
    Aljedani, Nawal
    Alotaibi, Reem
    Taileb, Mounira
    [J]. International Journal of Advanced Computer Science and Applications, 2020, 11 (10): : 694 - 706
  • [22] Multi-Label Arabic Text Classification: An Overview
    Aljedani, Nawal
    Alotaibi, Reem
    Taileb, Mounira
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 694 - 706
  • [23] Integrating Label Semantic Similarity Scores into Multi-label Text Classification
    Chen, Zihao
    Liu, Yang
    Cheng, Baitai
    Peng, Jing
    [J]. ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT II, 2022, 13530 : 234 - 245
  • [24] Multi-Label Text Classification Based on Label Combination and Fusion of Attentions
    Wu, Xinke
    Sun, Jun
    Li, Zhihua
    [J]. Computer Engineering and Applications, 2023, 59 (06) : 125 - 133
  • [25] Multi-label Text Classification Method Based on Label Semantic Information
    Xiao, Lin
    Chen, Bo-Li
    Huang, Xin
    Liu, Hua-Feng
    Jing, Li-Ping
    Yu, Jian
    [J]. Ruan Jian Xue Bao/Journal of Software, 2020, 31 (04): : 1079 - 1089
  • [26] Label-Specific Document Representation for Multi-Label Text Classification
    Xiao, Lin
    Huang, Xin
    Chen, Boli
    Jing, Liping
    [J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 466 - 475
  • [27] Multi-label text classification based on the label correlation mixture model
    He, Zhiyang
    Wu, Ji
    Lv, Ping
    [J]. INTELLIGENT DATA ANALYSIS, 2017, 21 (06) : 1371 - 1392
  • [28] Variational Continuous Label Distribution Learning for Multi-Label Text Classification
    Zhao, Xingyu
    An, Yuexuan
    Xu, Ning
    Geng, Xin
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (06) : 2716 - 2729
  • [29] ADAM: An Attentional Data Augmentation Method for Extreme Multi-label Text Classification
    Zhang, Jiaxin
    Liu, Jie
    Chen, Shaowei
    Lin, Shaoxin
    Wang, Bingquan
    Wang, Shanpeng
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT I, 2022, 13280 : 131 - 142
  • [30] XRR: Extreme multi-label text classification with candidate retrieving and deep ranking
    Xiong, Jie
    Yu, Li
    Niu, Xi
    Leng, Youfang
    [J]. INFORMATION SCIENCES, 2023, 622 : 115 - 132