Handling Cross- and Out-of-Domain Samples in ThaiWord Segmentation

被引:0
|
作者
Limkonchotiwat, Peerat [1 ]
Phatthiyaphaibun, Wannaphong [2 ]
Sarwar, Raheem [3 ]
Chuangsuwanich, Ekapol [4 ]
Nutanong, Sarana [1 ]
机构
[1] VISTEC, Sch Informat Sci & Technol, Hanoi, Vietnam
[2] Khon Kaen Univ, Fac Interdisciplinary Studies, Khon Kaen, Thailand
[3] Univ Wolverhampton, RGCL, Wolverhampton, England
[4] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method's generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai's.
引用
收藏
页码:1003 / 1016
页数:14
相关论文
共 50 条
  • [31] Incorporating dialogue context and topic clustering in out-of-domain detection
    Lane, IR
    Kawahara, T
    2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5: SPEECH PROCESSING, 2005, : 1045 - 1048
  • [32] Improving Out-of-domain Sentiment Polarity Classification using Argumentation
    Carstens, Lucas
    Toni, Francesca
    2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOP (ICDMW), 2015, : 1294 - 1301
  • [33] Exploiting Out-of-Vocabulary Words for Out-of-Domain Detection in Dialog Systems
    Ryu, Seonghan
    Lee, Donghyeon
    Lee, Gary Geunbae
    Kim, Kyungduk
    Noh, Hyungjong
    2014 INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2014, : 165 - +
  • [34] Editing Out-of-Domain GAN Inversion via Differential Activations
    Song, Haorui
    Du, Yong
    Xiang, Tianyi
    Dong, Junyu
    Qin, Jing
    He, Shengfeng
    COMPUTER VISION - ECCV 2022, PT XVII, 2022, 13677 : 1 - 17
  • [35] Approaches for Out-of-Domain Adaptation to Improve Speaker Recognition Performance
    Shulipa, Andrey
    Novoselov, Sergey
    Melnikov, Aleksandr
    SPEECH AND COMPUTER, 2016, 9811 : 124 - 130
  • [36] In-Domain versus Out-of-Domain training for Text-Dependent JFA
    Kenny, Patrick
    Stafylakis, Themos
    Alam, Jahangir
    Ouellet, Pierre
    Kockmann, Marcel
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1332 - 1336
  • [37] OodGAN: Generative Adversarial Network for Out-of-Domain Data Generation
    Marek, Petr
    Naik, Vishal Ishwar
    Auvray, Vincent
    Goyal, Anuj
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 238 - 245
  • [38] In-domain versus out-of-domain transfer learning in plankton image classification
    Andrea Maracani
    Vito Paolo Pastore
    Lorenzo Natale
    Lorenzo Rosasco
    Francesca Odone
    Scientific Reports, 13
  • [39] KNN-Contrastive Learning for Out-of-Domain Intent Classification
    Zhou, Yunhua
    Liu, Peiju
    Qiu, Xipeng
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 5129 - 5141
  • [40] Combining the Predictions of Out-of-Domain Classifiers Using Etcetera Abduction
    Gordon, Andrew S.
    Feng, Andrew
    2024 58TH ANNUAL CONFERENCE ON INFORMATION SCIENCES AND SYSTEMS, CISS, 2024,