On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining

被引：0

作者：

Meng, Yu ^{[1
]}

Huang, Jiaxin ^{[1
]}

Zhang, Yu ^{[1
]}

Han, Jiawei ^{[1
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

基金：

美国国家科学基金会;

关键词：

Text Embedding; Language Models; Topic Discovery; Text Mining;

D O I：

10.1145/3447548.3470810

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses(1).

引用

页码：4052 / 4053

页数：2

共 50 条

[41] RoBERTuito: a pre-trained language model for social media text in Spanish
Manuel Perez, Juan
Furman, Damian A.
Alonso Alemany, Laura
Luque, Franco
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7235 - 7243
[42] Automatic Prosody Annotation with Pre-Trained Text-Speech Model
Dai, Ziqian
Yu, Jianwei
Wang, Yan
Chen, Nuo
Bian, Yanyao
Li, Guangzhi
Cai, Deng
Yu, Dong
[J]. INTERSPEECH 2022, 2022, : 5513 - 5517
[43] Leveraging Pre-Trained Language Model for Summary Generation on Short Text
Zhao, Shuai
You, Fucheng
Liu, Zeng Yuan
[J]. IEEE ACCESS, 2020, 8 : 228798 - 228803
[44] Hindi Abstractive Text Summarization using Transliteration with Pre-trained Model
Kumar, Jeetendra
Shekhar, Shashi
Gupta, Rashmi
[J]. JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 2089 - 2110
[45] Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines
Liao, Junwei
Cheng, Shuai
Tan, Minghuan
[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
[46] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
Zhong, Shanshan
Huang, Zhongzhan
Wen, Wushao
Qin, Jinghui
Lin, Liang
[J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 567 - 578
[47] Automatic Title Generation for Text with Pre-trained Transformer Language Model
Mishra, Prakhar
Diwan, Chaitali
Srinivasa, Srinath
Srinivasaraghavan, G.
[J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2021), 2021, : 17 - 24
[48] FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models
Chada, Rakesh
Natarajan, Pradeep
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6081 - 6090
[49] Sentence-T5 (ST5): Scalable Sentence Encoders from Pre-trained Text-to-Text Models
Ni, Jianmo
Abrego, Gustavo Hernandez
Constant, Noah
Ma, Ji
Hall, Keith B.
Cer, Daniel
Yang, Yinfei
[J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1864 - 1874
[50] Investigating Prompt Learning for Chinese Few-Shot Text Classification with Pre-Trained Language Models
Song, Chengyu
Shao, Taihua
Lin, Kejing
Liu, Dengfeng
Wang, Siyuan
Chen, Honghui
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (21):

← 1 2 3 4 5 →