Automatic label curation from large-scale text corpus

被引:0
|
作者
Avasthi, Sandhya [1 ]
Chauhan, Ritu [2 ]
机构
[1] ABES Engn Coll, Dept CSE, Ghaziabad, India
[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India
来源
ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期
关键词
automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;
D O I
10.1088/2631-8695/ad299e
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization
    Hu, Yan
    Wu, Wei
    Miao, Miao
    IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 640 - 645
  • [2] Temporal knowledge extraction from large-scale text corpus
    Yu Liu
    Wen Hua
    Xiaofang Zhou
    World Wide Web, 2021, 24 : 135 - 156
  • [3] Temporal knowledge extraction from large-scale text corpus
    Liu, Yu
    Hua, Wen
    Zhou, Xiaofang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (01): : 135 - 156
  • [4] Extracting Temporal Patterns from Large-Scale Text Corpus
    Liu, Yu
    Hua, Wen
    Zhou, Xiaofang
    DATABASES THEORY AND APPLICATIONS (ADC 2019), 2019, 11393 : 17 - 30
  • [5] Automatic Acquisition of Large-scale Academic Bilingual Parallel Corpus from the Web
    Han Yong
    Li Yu
    He Xiaoning
    Yang Muyun
    Lei Guohua
    2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2009, : 318 - 321
  • [6] Automatic Speech Recognition of Vietnamese for a New Large-Scale Corpus
    Tran, Linh Thi Thuc
    Kim, Han-Gyu
    La, Hoang Minh
    Pham, Su Van
    ELECTRONICS, 2024, 13 (05)
  • [7] A LARGE-SCALE CHINESE LONG-TEXT EXTRACTIVE SUMMARIZATION CORPUS
    Chen, Kai
    Fu, Guanyu
    Chen, Qingcai
    Hu, Baotian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7828 - 7832
  • [8] eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
    Negri, Matteo
    Turchi, Marco
    Chatterjee, Rajen
    Bertoldi, Nicola
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 24 - 30
  • [9] Large-Scale Multi-Label Text Classification on EU Legislation
    Chalkidis, Ilias
    Fergadiotis, Manos
    Malakasiotis, Prodromos
    Androutsopoulos, Ion
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6314 - 6322
  • [10] Computational Curation and the Application of Large-Scale Vocabularies
    Grabus, Sam
    Greenberg, Jane
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2220 - 2223