From Characters toWords: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

被引:0
|
作者
Sun, Li [1 ]
Luisier, Florian [2 ]
Batmanghelich, Kayhan [1 ]
Florencio, Dinei [2 ]
Zhang, Cha [2 ]
机构
[1] Boston Univ, Boston, MA 02215 USA
[2] Microsoft, Redmond, WA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model's robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intraword module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.
引用
收藏
页码:3605 / 3620
页数:16
相关论文
共 50 条
  • [41] Leveraging Pre-trained Language Model for Speech Sentiment Analysis
    Shon, Suwon
    Brusco, Pablo
    Pan, Jing
    Han, Kyu J.
    Watanabe, Shinji
    INTERSPEECH 2021, 2021, : 3420 - 3424
  • [42] Software Vulnerabilities Detection Based on a Pre-trained Language Model
    Xu, Wenlin
    Li, Tong
    Wang, Jinsong
    Duan, Haibo
    Tang, Yahui
    2023 IEEE 22ND INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM, BIGDATASE, CSE, EUC, ISCI 2023, 2024, : 904 - 911
  • [43] AraXLNet: pre-trained language model for sentiment analysis of Arabic
    Alduailej, Alhanouf
    Alothaim, Abdulrahman
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [44] A survey of text classification based on pre-trained language model
    Wu, Yujia
    Wan, Jun
    NEUROCOMPUTING, 2025, 616
  • [45] Integrating Pre-Trained Language Model With Physical Layer Communications
    Lee, Ju-Hyung
    Lee, Dong-Ho
    Lee, Joohan
    Pujara, Jay
    IEEE TRANSACTIONS ON WIRELESS COMMUNICATIONS, 2024, 23 (11) : 17266 - 17278
  • [46] SsciBERT: a pre-trained language model for social science texts
    Shen, Si
    Liu, Jiangfeng
    Lin, Litao
    Huang, Ying
    Zhang, Lin
    Liu, Chang
    Feng, Yutong
    Wang, Dongbo
    SCIENTOMETRICS, 2023, 128 (02) : 1241 - 1263
  • [47] Interpretability of Entity Matching Based on Pre-trained Language Model
    Liang Z.
    Wang H.-Z.
    Dai J.-J.
    Shao X.-Y.
    Ding X.-O.
    Mu T.-Y.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (03): : 1087 - 1108
  • [48] TextPruner: A Model Pruning Toolkit for Pre-Trained Language Models
    Yang, Ziqing
    Cui, Yiming
    Chen, Zhigang
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, 2022, : 35 - 43
  • [49] Learning and Evaluating a Differentially Private Pre-trained Language Model
    Hoory, Shlomo
    Feder, Amir
    Tendler, Avichai
    Cohen, Alon
    Erell, Sofia
    Laish, Itay
    Nakhost, Hootan
    Stemmer, Uri
    Benjamini, Ayelet
    Hassidim, Avinatan
    Matias, Yossi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 1178 - 1189
  • [50] From Cloze to Comprehension: Retrofitting Pre-trained Masked Language Models to Pre-trained Machine Reader
    Xu, Weiwen
    Li, Xin
    Zhang, Wenxuan
    Zhou, Meng
    Lam, Wai
    Si, Luo
    Bing, Lidong
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,