Stemmer and phonotactic rules to improve n-gram tagger-based indonesian phonemicization

被引：0

作者：

Suyanto, Suyanto ^{[1
]}

Sunyoto, Andi ^{[2
]}

Ismail, Rezza Nafi ^{[1
]}

Rachmawati, Ema ^{[1
]}

Maharani, Warih ^{[1
]}

机构：

[1] Telkom Univ, Sch Comp, Bandung, Indonesia

[2] Univ Amikom Yogyakarta, Fac Comp Sci, Yogyakarta, Indonesia

来源：

JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES | 2022年 / 34卷 / 06期

关键词：

grapheme-to-phoneme conversion; Indonesian language; n-gram; Phonotactic rules; Stemmer; MODEL;

D O I：

10.1016/j.jksuci.2021.01.006

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A phonemicization or grapheme-to-phoneme conversion (G2P) is a process of converting a word into its pronunciation. It is one of the essential components in speech synthesis, speech recognition, and natural language processing. The deep learning (DL)-based state-of-the-art G2P model generally gives low phoneme error rate (PER) as well as word error rate (WER) for high-resource languages, such as English and European, but not for low-resource languages. Therefore, some conventional machine learning (ML) based G2P models incorporated with specific linguistic knowledge are preferable for low-resource languages. However, these models are poor for several low-resource languages because of various issues. For instance, an Indonesian G2P model works well for roots but gives a high PER for derivatives. Most errors come from the ambiguities of some roots and derivative words containing four prefixes: < ber >, < meng >, < peng >, and < ter >. In this research, an Indonesian G2P model based on n-gram combined with stemmer and phonotactic rules (NGTSP) is proposed to solve those problems. An investigation based on 5-fold cross-validation, using 50 k Indonesian words, informs that the proposed NGTSP gives a much lower PER of 0.78% than the state-of-the-art Transformer-based G2P model (1.14%). Besides, it also provides a much faster processing time. (C) 2021 The Authors. Published by Elsevier B.V. on behalf of King Saud University.

引用

页码：3807 / 3814

页数：8

共 50 条

[31] Text authorship detection using decision trees and association rules over N-gram
Course of Information and Computer Sciences, Graduate School of Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi-shi, Kanagawa 243-0292, Japan
[J]. Proc. IADIS Int. Conf. Intelligent Syst. Agents, Proc. IADIS Eur. Conf. Data Min., Part MCCSIS, (167-170):
[32] Cascade Morphological n-gram can Improve Chinese Words Representation Learning
Yang, Haobo
Xiong, Zongyang
Zhang, Jiexin
Qin, Ke
Lu, Guoming
[J]. 2019 INTERNATIONAL CONFERENCE ON INTERNET OF THINGS (ITHINGS) AND IEEE GREEN COMPUTING AND COMMUNICATIONS (GREENCOM) AND IEEE CYBER, PHYSICAL AND SOCIAL COMPUTING (CPSCOM) AND IEEE SMART DATA (SMARTDATA), 2019, : 842 - 847
[33] Building a syntactic rules-based stemmer to improve search effectiveness for arabic language
Cherif, Walid
Madani, Abdellah
Kissi, Mohamed
[J]. 2014 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS: THEORIES AND APPLICATIONS (SITA'14), 2014,
[34] Answering questions with an n-gram based passage retrieval engine
Davide Buscaldi
Paolo Rosso
José Manuel Gómez-Soriano
Emilio Sanchis
[J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
[35] On the Impact of Tokenizer and Parameters on N-Gram Based Code Analysis
Jimenez, Matthieu
Cordy, Maxime
Le Traon, Yves
Papadakis, Mike
[J]. PROCEEDINGS 2018 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION (ICSME), 2018, : 437 - 448
[36] Word N-gram Based Classification for Data Leakage Prevention
Alneyadi, Sultan
Sithirasenan, Elankayer
Muthukkumarasamy, Vallipuram
[J]. 2013 12TH IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS (TRUSTCOM 2013), 2013, : 578 - 585
[37] The textcat Package for n-Gram Based Text Categorization in R
Hornik, Kurt
Mair, Patrick
Rauch, Johannes
Geiger, Wilhelm
Buchta, Christian
Feinerer, Ingo
[J]. JOURNAL OF STATISTICAL SOFTWARE, 2013, 52 (06):
[38] Association Analysis and N-Gram Based Detection of Incorrect Arguments
Li C.
Liu H.
[J]. Ruan Jian Xue Bao/Journal of Software, 2018, 29 (08): : 2243 - 2257
[39] Bangla Word Clustering Based on N-gram Language Model
Ismail, Sabir
Rahman, M. Shahidur
[J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,
[40] Partitioning Based N-Gram Feature Selection for Malware Classification
Hu, Weiwei
Tan, Ying
[J]. DATA MINING AND BIG DATA, DMBD 2016, 2016, 9714 : 187 - 195

← 1 2 3 4 5 →