MatchXML: An Efficient Text-Label Matching Framework for Extreme Multi-Label Text Classification

被引:0
|
作者
Ye, Hui [1 ]
Sunderraman, Rajshekhar [1 ]
Ji, Shihao [1 ]
机构
[1] Georgia State Univ, Dept Comp Sci, Atlanta, GA 30302 USA
基金
美国国家科学基金会;
关键词
Training; Transformers; Task analysis; Vectors; Self-supervised learning; Text categorization; Semantics; Extreme multi-label classification; label2vec; text-label matching; bipartite graph; contrastive learning;
D O I
10.1109/TKDE.2024.3374750
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The eXtreme Multi-label text Classification (XMC) refers to training a classifier that assigns a text sample with relevant labels from an extremely large-scale label set (e.g., millions of labels). We propose MatchXML, an efficient text-label matching framework for XMC. We observe that the label embeddings generated from the sparse Term Frequency-Inverse Document Frequency (TF-IDF) features have several limitations. We thus propose label2vec to effectively train the semantic dense label embeddings by the Skip-gram model. The dense label embeddings are then used to build a Hierarchical Label Tree by clustering. In fine-tuning the pre-trained encoder Transformer, we formulate the multi-label text classification as a text-label matching problem in a bipartite graph. We then extract the dense text representations from the fine-tuned Transformer. Besides the fine-tuned dense text embeddings, we also extract the static dense sentence embeddings from a pre-trained Sentence Transformer. Finally, a linear ranker is trained by utilizing the sparse TF-IDF features, the fine-tuned dense text representations, and static dense sentence features. Experimental results demonstrate that MatchXML achieves the state-of-the-art accuracies on five out of six datasets. As for the training speed, MatchXML outperforms the competing methods on all the six datasets.
引用
收藏
页码:4781 / 4793
页数:13
相关论文
共 50 条
  • [41] Hierarchical Multi-label Classification of Text with Capsule Networks
    Aly, Rami
    Remus, Steffen
    Biemann, Chris
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019:): STUDENT RESEARCH WORKSHOP, 2019, : 323 - 330
  • [42] Multi-label dataless text classification with topic modeling
    Daochen Zha
    Chenliang Li
    [J]. Knowledge and Information Systems, 2019, 61 : 137 - 160
  • [43] A Combined Approach for Multi-Label Text Data Classification
    Strimaitis, Rokas
    Stefanovic, Pavel
    Ramanauskaite, Simona
    Slotkiene, Asta
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [44] A novel reasoning mechanism for multi-label text classification
    Wang, Ran
    Ridley, Robert
    Su, Xi'ao
    Qu, Weiguang
    Dai, Xinyu
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2021, 58 (02)
  • [45] Academic Resource Text Hierarchical Multi-Label Classification
    Wang, Yue
    Li, Yawen
    Li, Ang
    [J]. Computer Engineering and Applications, 2023, 59 (13) : 92 - 98
  • [46] A NEW INPUT REPRESENTATION FOR MULTI-LABEL TEXT CLASSIFICATION
    Alfaro, Rodrigo
    Allende, Hector
    [J]. 2011 INTERNATIONAL CONFERENCE ON INSTRUMENTATION, MEASUREMENT, CIRCUITS AND SYSTEMS (ICIMCS 2011), VOL 3: COMPUTER-AIDED DESIGN, MANUFACTURING AND MANAGEMENT, 2011, : 207 - 210
  • [47] Hierarchical Multi-Label Classification of Social Text Streams
    Ren, Zhaochun
    Peetz, Maria-Hendrike
    Liang, Shangsong
    van Dolen, Willemijn
    de Rijke, Maarten
    [J]. SIGIR'14: PROCEEDINGS OF THE 37TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2014, : 213 - 222
  • [48] On the Value of Head Labels in Multi-Label Text Classification
    Wang, Haobo
    Peng, Cheng
    Dong, Hede
    Feng, Lei
    Liu, Weiwei
    Hu, Tianlei
    Chen, Ke
    Chen, Gang
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2024, 18 (05)
  • [49] Multi-label text classification with an ensemble feature space
    Tandon, Kushagri
    Chatterjee, Niladri
    [J]. Journal of Intelligent and Fuzzy Systems, 2022, 42 (05): : 4425 - 4436
  • [50] Active Learning Strategies for Multi-Label Text Classification
    Esuli, Andrea
    Sebastiani, Fabrizio
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2009, 5478 : 102 - +