Automatic label curation from large-scale text corpus

被引：0

作者：

Avasthi, Sandhya ^{[1
]}

Chauhan, Ritu ^{[2
]}

机构：

[1] ABES Engn Coll, Dept CSE, Ghaziabad, India

[2] Amity Univ, Ctr Computat Biol & Bioinformat, AI & IoT Lab, Noida, Uttar Pradesh, India

来源：

ENGINEERING RESEARCH EXPRESS | 2024年 / 6卷 / 01期

关键词：

automatic labeling; contextual word embedding; latent dirichlet allocation; topic modeling; topic coherence; topic label;

D O I：

10.1088/2631-8695/ad299e

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

The topic modeling technique extracts themes based on their probabilistic measurements from any large-scale text collection. Even though topic modeling pulls out the most important phrases that describe latent themes in text collections, a suitable label has yet to be found. Interpreting the topics extracted from a text corpus and identifying a suitable label automatically reduces the cognitive load for the analyst. Extractive methods are used typically to select a label from a given candidate set, based on probability metrics for each candidate set. Some of the existing approaches use phrases, words, and images to generate labels using frequency counts of different words in the text. The paper proposes a method to generate labels automatically to represent each topic based on a labeling strategy to filter candidate labels and then apply sequence-to-sequence labelers. The objective of the method is to get a meaningful label for the result of the Latent Dirichlet Allocation algorithm. The BERTScore metric is used to evaluate the effectiveness of the proposed method. The proposed method generates good interpretative labels as compared to baseline models for topic words or terms automatically. The comparison with the label generated through ChatGPT API shows the quality of the generated label with the experiment performed on Four Datasets NIPS, Kindle, PUBMED, and CORD-19.

引用

页数：14

共 50 条

[31] An automatic image-text alignment method for large-scale web image retrieval
Baopeng Zhang
Yanyun Qu
Jinye Peng
Jianping Fan
Multimedia Tools and Applications, 2017, 76 : 21401 - 21421
[32] Automatic image-text alignment for large-scale web image indexing and retrieval
Zhou, Ning
Fan, Jianping
PATTERN RECOGNITION, 2015, 48 (01) : 205 - 219
[33] Creating a Large-Scale Silver Corpus from Multiple Algorithmic Segmentations
Krenn, Markus
Dorfer, Matthias
del Toro, Oscar Alfonso Jimenez
Mueller, Henning
Menze, Bjoern
Weber, Marc-Andre
Hanbury, Allan
Langs, Georg
MEDICAL COMPUTER VISION: ALGORITHMS FOR BIG DATA, 2016, 9601 : 103 - 115
[34] Extracting answers to natural language questions from large-scale corpus
Li, P
Wang, XL
Guan, Y
Zhao, YM
PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 690 - 694
[35] TAE: Topic-aware encoder for large-scale multi-label text classification
Qin, Shaowei
Wu, Hao
Zhou, Lihua
Zhao, Yiji
Zhang, Lei
APPLIED INTELLIGENCE, 2024, 54 (08) : 6269 - 6284
[36] Adjusting BERT's Pooling Layer for Large-Scale Multi-Label Text Classification
Lehecka, Jan
Svec, Jan
Ircing, Pavel
Smidl, Lubos
TEXT, SPEECH, AND DIALOGUE (TSD 2020), 2020, 12284 : 214 - 221
[37] Large-Scale Graph Label Propagation on GPUs
Ye, Chang
Li, Yuchen
He, Bingsheng
Li, Zhao
Sun, Jianling
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2024, 36 (10) : 5234 - 5248
[38] Using Automatic Speech Recognition in Spoken Corpus Curation
Gorisch, Jan
Gref, Michael
Schmidt, Thomas
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6423 - 6428
[39] ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Lee, Sangho
Chung, Jiwan
Yu, Youngjae
Kim, Gunhee
Breuel, Thomas
Chechik, Gal
Song, Yale
2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10254 - 10264
[40] Topic modeling for large-scale text data
Li, Xi-ming
Ouyang, Ji-hong
Lu, You
FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465

← 1 2 3 4 5 →