On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining

被引：0

作者：

Meng, Yu ^{[1
]}

Huang, Jiaxin ^{[1
]}

Zhang, Yu ^{[1
]}

Han, Jiawei ^{[1
]}

机构：

[1] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA

来源：

KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING | 2021年

基金：

美国国家科学基金会;

关键词：

Text Embedding; Language Models; Topic Discovery; Text Mining;

D O I：

10.1145/3447548.3470810

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses(1).

引用

页码：4052 / 4053

页数：2

共 50 条

[1] ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
Minh Phuc Nguyen
Vu Hoang Tran
Vu Hoang
Ta Duc Huy
Bui, Trung H.
Truong, Steven Q. H.
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 328 - 337
[2] Text clustering based on pre-trained models and autoencoders
Xu, Qiang
Gu, Hao
Ji, ShengWei
[J]. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2024, 17
[3] Pre-Trained Language Models for Text Generation: A Survey
Li, Junyi
Tang, Tianyi
Zhao, Wayne Xin
Nie, Jian-Yun
Wen, Ji-Rong
[J]. ACM COMPUTING SURVEYS, 2024, 56 (09)
[4] STILE: Exploring and Debugging Social Biases in Pre-trained Text Representations
Kabir, Samia
Li, Lixiang
Zhang, Tianyi
[J]. PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,
[5] Short-Text Classification Method with Text Features from Pre-trained Models
Chen, Jie
Ma, Jing
Li, Xiaofeng
[J]. Data Analysis and Knowledge Discovery, 2021, 5 (09) : 21 - 30
[6] Text Detoxification using Large Pre-trained Neural Models
Dale, David
Voronov, Anton
Dementieva, Daryna
Logacheva, Varvara
Kozlova, Olga
Semenov, Nikita
Panchenko, Alexander
[J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7979 - 7996
[7] Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression
Wang, Yuxia
Beck, Daniel
Baldwin, Timothy
Verspoor, Karin
[J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 680 - 696
[8] EFFICIENT TEXT ANALYSIS WITH PRE-TRAINED NEURAL NETWORK MODELS
Cui, Jia
Lu, Heng
Wang, Wenjie
Kang, Shiyin
He, Liqiang
Li, Guangzhi
Yu, Dong
[J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 671 - 676
[9] BioGPT: generative pre-trained transformer for biomedical text generation and mining
Luo, Renqian
Sun, Liai
Xia, Yingce
Qin, Tao
Zhang, Sheng
Poon, Hoifung
Liu, Tie-Yan
[J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (06)
[10] EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining
Cai, Zerui
Zhang, Taolin
Wang, Chengyu
He, Xiaofeng
[J]. WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 242 - 257

← 1 2 3 4 5 →