On the Power of Pre-Trained Text Representations: Models and Applications in Text Mining

被引:0
|
作者
Meng, Yu [1 ]
Huang, Jiaxin [1 ]
Zhang, Yu [1 ]
Han, Jiawei [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Champaign, IL 61820 USA
基金
美国国家科学基金会;
关键词
Text Embedding; Language Models; Topic Discovery; Text Mining;
D O I
10.1145/3447548.3470810
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses(1).
引用
收藏
页码:4052 / 4053
页数:2
相关论文
共 50 条
  • [41] RoBERTuito: a pre-trained language model for social media text in Spanish
    Manuel Perez, Juan
    Furman, Damian A.
    Alonso Alemany, Laura
    Luque, Franco
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7235 - 7243
  • [42] Automatic Prosody Annotation with Pre-Trained Text-Speech Model
    Dai, Ziqian
    Yu, Jianwei
    Wang, Yan
    Chen, Nuo
    Bian, Yanyao
    Li, Guangzhi
    Cai, Deng
    Yu, Dong
    [J]. INTERSPEECH 2022, 2022, : 5513 - 5517
  • [43] Leveraging Pre-Trained Language Model for Summary Generation on Short Text
    Zhao, Shuai
    You, Fucheng
    Liu, Zeng Yuan
    [J]. IEEE ACCESS, 2020, 8 : 228798 - 228803
  • [44] Hindi Abstractive Text Summarization using Transliteration with Pre-trained Model
    Kumar, Jeetendra
    Shekhar, Shashi
    Gupta, Rashmi
    [J]. JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (03) : 2089 - 2110
  • [45] Text Polishing with Chinese Idiom: Task, Datasets and Pre-trained Baselines
    Liao, Junwei
    Cheng, Shuai
    Tan, Minghuan
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (06)
  • [46] SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models
    Zhong, Shanshan
    Huang, Zhongzhan
    Wen, Wushao
    Qin, Jinghui
    Lin, Liang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 567 - 578
  • [47] Automatic Title Generation for Text with Pre-trained Transformer Language Model
    Mishra, Prakhar
    Diwan, Chaitali
    Srinivasa, Srinath
    Srinivasaraghavan, G.
    [J]. 2021 IEEE 15TH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2021), 2021, : 17 - 24
  • [48] FewshotQA: A simple framework for few-shot learning of question answering tasks using pre-trained text-to-text models
    Chada, Rakesh
    Natarajan, Pradeep
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6081 - 6090
  • [49] Sentence-T5 (ST5): Scalable Sentence Encoders from Pre-trained Text-to-Text Models
    Ni, Jianmo
    Abrego, Gustavo Hernandez
    Constant, Noah
    Ma, Ji
    Hall, Keith B.
    Cer, Daniel
    Yang, Yinfei
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 1864 - 1874
  • [50] Investigating Prompt Learning for Chinese Few-Shot Text Classification with Pre-Trained Language Models
    Song, Chengyu
    Shao, Taihua
    Lin, Kejing
    Liu, Dengfeng
    Wang, Siyuan
    Chen, Honghui
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (21):