Stemmer and phonotactic rules to improve n-gram tagger-based indonesian phonemicization

被引:0
|
作者
Suyanto, Suyanto [1 ]
Sunyoto, Andi [2 ]
Ismail, Rezza Nafi [1 ]
Rachmawati, Ema [1 ]
Maharani, Warih [1 ]
机构
[1] Telkom Univ, Sch Comp, Bandung, Indonesia
[2] Univ Amikom Yogyakarta, Fac Comp Sci, Yogyakarta, Indonesia
关键词
grapheme-to-phoneme conversion; Indonesian language; n-gram; Phonotactic rules; Stemmer; MODEL;
D O I
10.1016/j.jksuci.2021.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML) based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: < ber >, < meng >, < peng >, and < ter >. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time. (C) 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
引用
收藏
页码:3807 / 3814
页数:8
相关论文
共 50 条
  • [21] Proposal of n-gram Based Algorithm for Malware Classification
    Pektas, Abdurrahman
    Eris, Mehmet
    Acarman, Tankut
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON EMERGING SECURITY INFORMATION, SYSTEMS AND TECHNOLOGIES (SECURWARE 2011), 2011, : 14 - 18
  • [22] IMPROVING N-GRAM LINGUISTIC STEGANOGRAPHY BASED ON TEMPLATES
    Munoz, Alfonso
    Carracedo Gallardo, Justo
    Arguelles Alvarez, Irina
    [J]. SECRYPT 2010: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SECURITY AND CRYPTOGRAPHY, 2010, : 209 - 212
  • [23] Advanced Information Extraction with n-gram based LSI
    Guven, Ahmet
    Bozkurt, O. Ozgur
    Kalipsiz, Oya
    [J]. PROCEEDINGS OF WORLD ACADEMY OF SCIENCE, ENGINEERING AND TECHNOLOGY, VOL 17, 2006, 17 : 13 - 18
  • [24] A software birthmark based on dynamic opcode n-gram
    Bin Lu
    Liu, Fenlin
    Ge, Xin
    Bin Liu
    Luo, Xiangyang
    [J]. ICSC 2007: INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, PROCEEDINGS, 2007, : 37 - +
  • [25] Opcode n-gram based Malware Classification in Android
    Sihag, Vikas
    Mitharwal, Anita
    Vardhan, Manu
    Singh, Pradeep
    [J]. PROCEEDINGS OF THE 2020 FOURTH WORLD CONFERENCE ON SMART TRENDS IN SYSTEMS, SECURITY AND SUSTAINABILITY (WORLDS4 2020), 2020, : 645 - 650
  • [26] Polish Word Recognition Based on n-Gram Methods
    Wojcicki, Piotr
    Zientarski, Tomasz
    [J]. IEEE ACCESS, 2024, 12 : 49817 - 49825
  • [27] Gram-Elites: N-Gram Based Quality-Diversity Search
    Biemer, Colan F.
    Hervella, Alejandro
    Cooper, Seth
    [J]. PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF DIGITAL GAMES, FDG 2021, 2021,
  • [28] Leveraging n-gram neural embeddings to improve deep learning DGA detection
    Morbidoni, Christian
    Spalazzi, Luca
    Teti, Antonio
    Cucchiarelli, Alessandro
    [J]. 37TH ANNUAL ACM SYMPOSIUM ON APPLIED COMPUTING, 2022, : 995 - 1004
  • [29] Log Posterior Approach in Learning Rules Generated using N-Gram based Edit distance for Keyword Search
    Priya, M.
    Kalpana, R.
    [J]. JOURNAL OF INTELLIGENT SYSTEMS, 2018, 27 (04) : 555 - 563
  • [30] An Adaptive Clustering Model that Integrates Expert Rules and N-gram Statistics for Coreference Resolution
    Bunescu, Razvan
    [J]. 20TH EUROPEAN CONFERENCE ON ARTIFICIAL INTELLIGENCE (ECAI 2012), 2012, 242 : 897 - 898