Automatic label curation from large-scale text corpus

被引：0

作者：

Avasthi, Sandhya ^{[1
]}

Chauhan, Ritu ^{[2
]}

机构：

[1] ABES Engn Coll, Dept CSE, Ghaziabad, India

[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India

来源：

ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期

关键词：

automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;

D O I：

10.1088/2631-8695/ad299e

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.

引用

页数：14

共 50 条

[1] The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization
Hu, Yan
Wu, Wei
Miao, Miao
IEEC 2009: FIRST INTERNATIONAL SYMPOSIUM ON INFORMATION ENGINEERING AND ELECTRONIC COMMERCE, PROCEEDINGS, 2009, : 640 - 645
[2] Temporal knowledge extraction from large-scale text corpus
Yu Liu
Wen Hua
Xiaofang Zhou
World Wide Web, 2021, 24 : 135 - 156
[3] Temporal knowledge extraction from large-scale text corpus
Liu, Yu
Hua, Wen
Zhou, Xiaofang
WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (01): : 135 - 156
[4] Extracting Temporal Patterns from Large-Scale Text Corpus
Liu, Yu
Hua, Wen
Zhou, Xiaofang
DATABASES THEORY AND APPLICATIONS (ADC 2019), 2019, 11393 : 17 - 30
[5] Automatic Acquisition of Large-scale Academic Bilingual Parallel Corpus from the Web
Han Yong
Li Yu
He Xiaoning
Yang Muyun
Lei Guohua
2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2009, : 318 - 321
[6] Automatic Speech Recognition of Vietnamese for a New Large-Scale Corpus
Tran, Linh Thi Thuc
Kim, Han-Gyu
La, Hoang Minh
Pham, Su Van
ELECTRONICS, 2024, 13 (05)
[7] A LARGE-SCALE CHINESE LONG-TEXT EXTRACTIVE SUMMARIZATION CORPUS
Chen, Kai
Fu, Guanyu
Chen, Qingcai
Hu, Baotian
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7828 - 7832
[8] eSCAPE: a Large-scale Synthetic Corpus for Automatic Post-Editing
Negri, Matteo
Turchi, Marco
Chatterjee, Rajen
Bertoldi, Nicola
PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 24 - 30
[9] Large-Scale Multi-Label Text Classification on EU Legislation
Chalkidis, Ilias
Fergadiotis, Manos
Malakasiotis, Prodromos
Androutsopoulos, Ion
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 6314 - 6322
[10] Computational Curation and the Application of Large-Scale Vocabularies
Grabus, Sam
Greenberg, Jane
2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 2220 - 2223

← 1 2 3 4 5 →