Automatic label curation from large-scale text corpus

被引:0
|
作者
Avasthi, Sandhya [1 ]
Chauhan, Ritu [2 ]
机构
[1] ABES Engn Coll, Dept CSE, Ghaziabad, India
[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India
来源
ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期
关键词
automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;
D O I
10.1088/2631-8695/ad299e
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] An automatic image-text alignment method for large-scale web image retrieval
    Baopeng Zhang
    Yanyun Qu
    Jinye Peng
    Jianping Fan
    Multimedia Tools and Applications, 2017, 76 : 21401 - 21421
  • [32] Automatic image-text alignment for large-scale web image indexing and retrieval
    Zhou, Ning
    Fan, Jianping
    PATTERN RECOGNITION, 2015, 48 (01) : 205 - 219
  • [33] Creating a Large-Scale Silver Corpus from Multiple Algorithmic Segmentations
    Krenn, Markus
    Dorfer, Matthias
    del Toro, Oscar Alfonso Jimenez
    Mueller, Henning
    Menze, Bjoern
    Weber, Marc-Andre
    Hanbury, Allan
    Langs, Georg
    MEDICAL COMPUTER VISION: ALGORITHMS FOR BIG DATA, 2016, 9601 : 103 - 115
  • [34] Extracting answers to natural language questions from large-scale corpus
    Li, P
    Wang, XL
    Guan, Y
    Zhao, YM
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 690 - 694
  • [35] TAE: Topic-aware encoder for large-scale multi-label text classification
    Qin, Shaowei
    Wu, Hao
    Zhou, Lihua
    Zhao, Yiji
    Zhang, Lei
    APPLIED INTELLIGENCE, 2024, 54 (08) : 6269 - 6284
  • [36] Adjusting BERT's Pooling Layer for Large-Scale Multi-Label Text Classification
    Lehecka, Jan
    Svec, Jan
    Ircing, Pavel
    Smidl, Lubos
    TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 214 - 221
  • [37] Large-Scale Graph Label Propagation on GPUs
    Ye, Chang
    Li, Yuchen
    He, Bingsheng
    Li, Zhao
    Sun, Jianling
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (10) : 5234 - 5248
  • [38] Using Automatic Speech Recognition in Spoken Corpus Curation
    Gorisch, Jan
    Gref, Michael
    Schmidt, Thomas
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6423 - 6428
  • [39] ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
    Lee, Sangho
    Chung, Jiwan
    Yu, Youngjae
    Kim, Gunhee
    Breuel, Thomas
    Chechik, Gal
    Song, Yale
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10254 - 10264
  • [40] Topic modeling for large-scale text data
    Li, Xi-ming
    Ouyang, Ji-hong
    Lu, You
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465