Emerging trends: Subwords, seriously?

被引:5
|
作者
Church, Kenneth Ward [1 ]
机构
[1] Baidu, Sunnyvale, CA 94089 USA
关键词
Subwords; Word pieces; Tokenization; Morphology; Etymology;
D O I
10.1017/S1351324920000145
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Subwords have become very popular, but the BERTa and ERNIEb tokenizers often produce surprising results. Byte pair encoding (BPE) trains a dictionary with a simple information theoretic criterion that sidesteps the need for special treatment of unknown words. BPE is more about training (populating a dictionary of word pieces) than inference (parsing an unknown word into word pieces). The parse at inference time can be ambiguous. Which parse should we use? For example, "electroneutral" can be parsed as electron-eu-tral or electro-neutral, and "bidirectional" can be parsed as bid-ire-ction-al and bi-directional. BERT and ERNIE tend to favor the parse with more word pieces. We propose minimizing the number of word pieces. To justify our proposal, a number of criteria will be considered: sound, meaning, etc. The prefix, bi-, has the desired vowel (unlike bid) and the desired meaning (bi is Latin for two, unlike bid, which is Germanic for offer).
引用
收藏
页码:375 / 382
页数:8
相关论文
共 50 条
  • [31] Separating words by occurrences of subwords
    Vyalyi M.N.
    Gimadeev R.A.
    Journal of Applied and Industrial Mathematics, 2014, 8 (02) : 293 - 299
  • [32] REPETITION OF SUBWORDS IN DOL LANGUAGES
    EHRENFEUCHT, A
    ROZENBERG, G
    INFORMATION AND CONTROL, 1983, 59 (1-3): : 13 - 35
  • [33] Cluster Algebras and Binary Subwords
    Bailey, Rachel
    Gunawan, Emily
    ORDER-A JOURNAL ON THE THEORY OF ORDERED SETS AND ITS APPLICATIONS, 2022, 39 (01): : 55 - 69
  • [34] GeneralizingWord Embeddings using Bag of Subwords
    Zhao, Jinman
    Mudgal, Sidharth
    Liang, Yingyu
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 601 - 606
  • [35] A pictorial dictionary for printed Farsi subwords
    Ebrahimi, Afshin
    Kabir, Ehisanollah
    PATTERN RECOGNITION LETTERS, 2008, 29 (05) : 656 - 663
  • [36] On the Maximal Number of Cubic Subwords in a String
    Kubica, Marcin
    Radoszewski, Jakub
    Rytter, Wojciech
    Walen, Tomasz
    COMBINATORIAL ALGORITHMS, 2009, 5874 : 345 - 355
  • [37] Text retrieval based on medical subwords
    Honeck, M
    Hahn, U
    Klar, W
    Schulz, S
    HEALTH DATA IN THE INFORMATION SOCIETY, 2002, 90 : 241 - 245
  • [38] Subwords in Reverse-Complement Order
    Péter L. Erdős
    Péter Ligeti
    Péter Sziklai
    David C. Torney
    Annals of Combinatorics, 2006, 10 : 415 - 430
  • [39] Counting -letter subwords in compositions
    Mansour, Toufik
    Sirhan, Basel O.
    Discrete Mathematics and Theoretical Computer Science, 2006, 8 (01): : 285 - 298
  • [40] Counting subwords in a partition of a set
    Mansour, Toufik
    Shattuck, Mark
    Yan, Sherry H. F.
    ELECTRONIC JOURNAL OF COMBINATORICS, 2010, 17 (01):