Emerging trends: Subwords, seriously?

被引:5
|
作者
Church, Kenneth Ward [1 ]
机构
[1] Baidu, Sunnyvale, CA 94089 USA
关键词
Subwords; Word pieces; Tokenization; Morphology; Etymology;
D O I
10.1017/S1351324920000145
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, "electroneutral" can be parsed as electron-eu-tral or electro-neutral, and "bidirectional" can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).
引用
收藏
页码:375 / 382
页数:8
相关论文
共 50 条
  • [1] THE EMERGING PROBLEM OF DIABETES IN THE SERIOUSLY MENTALLY ILL
    Sajatovic, Martha
    Dawson, Neal V.
    PSYCHIATRIA DANUBINA, 2010, 22 : S4 - S5
  • [2] A relation by palindromic subwords
    Daley, Mark
    Mahalingam, Kalpana
    NATURAL COMPUTING, 2010, 9 (04) : 935 - 954
  • [3] EMERGING TRENDS
    Chen, Wenxiang
    Advanced Materials and Processes, 2023, 181 (02):
  • [4] A relation by palindromic subwords
    Mark Daley
    Kalpana Mahalingam
    Natural Computing, 2010, 9 : 935 - 954
  • [5] ON SUBWORDS OF INFINITE WORDS
    ILIE, L
    DISCRETE APPLIED MATHEMATICS, 1995, 63 (03) : 277 - 279
  • [6] On subwords of infinite words
    Discrete Appl Math, 3 (277):
  • [7] Minimal forbidden subwords
    Petkovic, T
    Ciric, M
    Bogdanovic, S
    INFORMATION PROCESSING LETTERS, 2004, 92 (05) : 211 - 218
  • [8] Counting Subwords and Regular Languages
    Colbourn, Charles J.
    Dougherty, Ryan E.
    Lidbetter, Thomas F.
    Shallit, Jeffrey
    DEVELOPMENTS IN LANGUAGE THEORY, DLT 2018, 2018, 11088 : 231 - 242
  • [9] Emerging Trends in Glycoscience
    Tiwari, Vinod K.
    SYNTHESIS-STUTTGART, 2024, 56 (06): : 887 - 889
  • [10] EMERGING TRENDS AND PROBLEMS
    DAVIS, KC
    LEVENTHAL, H
    UNGER, S
    BARRETT, SJ
    FLUG, JF
    SEMER, MP
    ADMINISTRATIVE LAW REVIEW, 1970, 22 (02) : 223 - 260