Automatic label curation from large-scale text corpus

被引:0
|
作者
Avasthi, Sandhya [1 ]
Chauhan, Ritu [2 ]
机构
[1] ABES Engn Coll, Dept CSE, Ghaziabad, India
[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India
来源
ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期
关键词
automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;
D O I
10.1088/2631-8695/ad299e
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Topic modeling for large-scale text data
    Xi-ming Li
    Ji-hong Ouyang
    You Lu
    Frontiers of Information Technology & Electronic Engineering, 2015, 16 : 457 - 465
  • [42] Large-Scale Text Mining of Biomedical Literature
    Ginter, Filip
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2013, (116): : 43 - 44
  • [43] Feature Extraction for Large-Scale Text Collections
    Gallagher, Luke
    Mallia, Antonio
    Culpepper, J. Shane
    Suel, Torsten
    Cambazoglu, B. Barla
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 3015 - 3022
  • [44] Large-Scale Text Similarity Computing with Spark
    Bao, Xiaoan
    Dai, Shichao
    Zhang, Na
    Yu, Chenghai
    INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2016, 9 (04): : 95 - 100
  • [45] Semi-automatic coding of open-ended text responses in large-scale assessments
    Andersen, Nico
    Zehner, Fabian
    Goldhammer, Frank
    JOURNAL OF COMPUTER ASSISTED LEARNING, 2023, 39 (03) : 841 - 854
  • [46] Learning from Video and Text via Large-Scale Discriminative Clustering
    Miech, Antoine
    Alayrac, Jean-Baptiste
    Bojanowski, Piotr
    Laptev, Ivan
    Sivic, Josef
    2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 5267 - 5276
  • [47] Constructing a large scale text corpus based on the grid and trustworthiness
    Li, Peifeng
    Zhu, Qiaoming
    Qian, Peide
    Fox, Geoffrey C.
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2007, 4629 : 56 - +
  • [48] Mining Preconditions of APIs in Large-Scale Code Corpus
    Hoan Anh Nguyen
    Dyer, Robert
    Nguyen, Tien N.
    Rajan, Hridesh
    22ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (FSE 2014), 2014, : 166 - 177
  • [49] Build a large-scale syntactically annotated Chinese corpus
    Qiang, Z
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 106 - 113
  • [50] Development of a Large-Scale Mandarin Radio Speech Corpus
    Chang, Yung-hsiang Shawn
    Liao, Yuan-fu
    Wang, Sheng-ming
    Wang, Jenq-haur
    Wang, Sing-yue
    Chen, Jhih-wei
    Chen, You-dian
    2017 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2017,