On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining

被引:0
|
作者
Meng, Yu [1 ]
Huang, Jiaxin [1 ]
Zhang, Yu [1 ]
Han, Jiawei [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA
基金
美国国家科学基金会;
关键词
Text Embedding; Language Models; Topic Discovery; Text Mining;
D O I
10.1145/3447548.3470810
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses(1).
引用
收藏
页码:4052 / 4053
页数:2
相关论文
共 50 条
  • [1] ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
    Minh Phuc Nguyen
    Vu Hoang Tran
    Vu Hoang
    Ta Duc Huy
    Bui, Trung H.
    Truong, Steven Q. H.
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 328 - 337
  • [2] Text clustering based on pre-trained models and autoencoders
    Xu, Qiang
    Gu, Hao
    Ji, ShengWei
    [J]. FRONTIERS IN COMPUTATIONAL NEUROSCIENCE, 2024, 17
  • [3] Pre-Trained Language Models for Text Generation: A Survey
    Li, Junyi
    Tang, Tianyi
    Zhao, Wayne Xin
    Nie, Jian-Yun
    Wen, Ji-Rong
    [J]. ACM COMPUTING SURVEYS, 2024, 56 (09)
  • [4] STILE: Exploring and Debugging Social Biases in Pre-trained Text Representations
    Kabir, Samia
    Li, Lixiang
    Zhang, Tianyi
    [J]. PROCEEDINGS OF THE 2024 CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYTEMS, CHI 2024, 2024,
  • [5] Short-Text Classification Method with Text Features from Pre-trained Models
    Chen, Jie
    Ma, Jing
    Li, Xiaofeng
    [J]. Data Analysis and Knowledge Discovery, 2021, 5 (09) : 21 - 30
  • [6] Text Detoxification using Large Pre-trained Neural Models
    Dale, David
    Voronov, Anton
    Dementieva, Daryna
    Logacheva, Varvara
    Kozlova, Olga
    Semenov, Nikita
    Panchenko, Alexander
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7979 - 7996
  • [7] Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression
    Wang, Yuxia
    Beck, Daniel
    Baldwin, Timothy
    Verspoor, Karin
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2022, 10 : 680 - 696
  • [8] EFFICIENT TEXT ANALYSIS WITH PRE-TRAINED NEURAL NETWORK MODELS
    Cui, Jia
    Lu, Heng
    Wang, Wenjie
    Kang, Shiyin
    He, Liqiang
    Li, Guangzhi
    Yu, Dong
    [J]. 2022 IEEE SPOKEN LANGUAGE TECHNOLOGY WORKSHOP, SLT, 2022, : 671 - 676
  • [9] BioGPT: generative pre-trained transformer for biomedical text generation and mining
    Luo, Renqian
    Sun, Liai
    Xia, Yingce
    Qin, Tao
    Zhang, Sheng
    Poon, Hoifung
    Liu, Tie-Yan
    [J]. BRIEFINGS IN BIOINFORMATICS, 2022, 23 (06)
  • [10] EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining
    Cai, Zerui
    Zhang, Taolin
    Wang, Chengyu
    He, Xiaofeng
    [J]. WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 242 - 257