Emerging trends: Subwords, seriously?

被引:5
|
作者
Church, Kenneth Ward [1 ]
机构
[1] Baidu, Sunnyvale, CA 94089 USA
关键词
Subwords; Word pieces; Tokenization; Morphology; Etymology;
D O I
10.1017/S1351324920000145
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, "electroneutral" can be parsed as electron-eu-tral or electro-neutral, and "bidirectional" can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).
引用
收藏
页码:375 / 382
页数:8
相关论文
共 50 条
  • [21] Emerging trends in education
    Monreal Guerrero, Ines Maria
    PIXEL-BIT- REVISTA DE MEDIOS Y EDUCACION, 2014, (44): : 231 - 232
  • [22] Climate Services in Asia Pacific Emerging Trends and Prospects Emerging Trends and Prospects
    Cheng, Chia-Ping
    Lin, Hen-I
    Wang, Simon
    Liu, Po-Ting Dean
    Chao, Kung-Yueh Camyale
    BULLETIN OF THE AMERICAN METEOROLOGICAL SOCIETY, 2020, 101 (09) : E1568 - E1571
  • [23] Emerging Powers and Emerging Trends in Global Governance
    Stephen, Matthew D.
    GLOBAL GOVERNANCE, 2017, 23 (03) : 483 - 502
  • [25] On words containing all short subwords
    Tomescu, I
    THEORETICAL COMPUTER SCIENCE, 1998, 197 (1-2) : 235 - 240
  • [26] Cluster Algebras and Binary Subwords
    Rachel Bailey
    Emily Gunawan
    Order, 2022, 39 : 55 - 69
  • [27] GEOMETRIC DISTRIBUTIONS AND FORBIDDEN SUBWORDS
    PRODINGER, H
    FIBONACCI QUARTERLY, 1995, 33 (02): : 139 - 141
  • [28] Counting subwords in flattened permutations
    Mansour, Toufik
    Shattuck, Mark
    Wang, David G. L.
    JOURNAL OF COMBINATORICS, 2013, 4 (03) : 327 - 356
  • [29] Chinese Word Embeddings with Subwords
    Yang, Gang
    Xu, Hongzhe
    Li, Wen
    2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [30] On the Minimum Density of Monotone Subwords
    Yuster, Raphael
    ELECTRONIC JOURNAL OF COMBINATORICS, 2025, 32 (01):