Stemmer and phonotactic rules to improve n-gram tagger-based indonesian phonemicization

被引:0
|
作者
Suyanto, Suyanto [1 ]
Sunyoto, Andi [2 ]
Ismail, Rezza Nafi [1 ]
Rachmawati, Ema [1 ]
Maharani, Warih [1 ]
机构
[1] Telkom Univ, Sch Comp, Bandung, Indonesia
[2] Univ Amikom Yogyakarta, Fac Comp Sci, Yogyakarta, Indonesia
关键词
grapheme-to-phoneme conversion; Indonesian language; n-gram; Phonotactic rules; Stemmer; MODEL;
D O I
10.1016/j.jksuci.2021.01.006
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML) based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: < ber >, < meng >, < peng >, and < ter >. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time. (C) 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.
引用
收藏
页码:3807 / 3814
页数:8
相关论文
共 50 条
  • [31] Text authorship detection using decision trees and association rules over N-gram
    Course of Information and Computer Sciences, Graduate School of Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi-shi, Kanagawa 243-0292, Japan
    [J]. Proc. IADIS Int. Conf. Intelligent Syst. Agents, Proc. IADIS Eur. Conf. Data Min., Part MCCSIS, (167-170):
  • [32] Cascade Morphological n-gram can Improve Chinese Words Representation Learning
    Yang, Haobo
    Xiong, Zongyang
    Zhang, Jiexin
    Qin, Ke
    Lu, Guoming
    [J]. 2019 INTERNATIONAL CONFERENCE ON INTERNET OF THINGS (ITHINGS) AND IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) AND IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) AND IEEE SMART DATA (SMARTDATA), 2019, : 842 - 847
  • [33] Building a syntactic rules-based stemmer to improve search effectiveness for arabic language
    Cherif, Walid
    Madani, Abdellah
    Kissi, Mohamed
    [J]. 2014 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA'14), 2014,
  • [34] Answering questions with an n-gram based passage retrieval engine
    Davide Buscaldi
    Paolo Rosso
    José Manuel Gómez-Soriano
    Emilio Sanchis
    [J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
  • [35] On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis
    Jimenez, Matthieu
    Cordy, Maxime
    Le Traon, Yves
    Papadakis, Mike
    [J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2018, : 437 - 448
  • [36] Word N-gram Based Classification for Data Leakage Prevention
    Alneyadi, Sultan
    Sithirasenan, Elankayer
    Muthukkumarasamy, Vallipuram
    [J]. 2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, : 578 - 585
  • [37] The textcat Package for n-Gram Based Text Categorization in R
    Hornik, Kurt
    Mair, Patrick
    Rauch, Johannes
    Geiger, Wilhelm
    Buchta, Christian
    Feinerer, Ingo
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2013, 52 (06):
  • [38] Association Analysis and N-Gram Based Detection of Incorrect Arguments
    Li C.
    Liu H.
    [J]. Ruan Jian Xue Bao/Journal of Software, 2018, 29 (08): : 2243 - 2257
  • [39] Bangla Word Clustering Based on N-gram Language Model
    Ismail, Sabir
    Rahman, M. Shahidur
    [J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,
  • [40] Partitioning Based N-Gram Feature Selection for Malware Classification
    Hu, Weiwei
    Tan, Ying
    [J]. DATA MINING AND BIG DATA, DMBD 2016, 2016, 9714 : 187 - 195