Handling Cross- and Out-of-Domain Samples in ThaiWord Segmentation

被引:0
|
作者
Limkonchotiwat, Peerat [1 ]
Phatthiyaphaibun, Wannaphong [2 ]
Sarwar, Raheem [3 ]
Chuangsuwanich, Ekapol [4 ]
Nutanong, Sarana [1 ]
机构
[1] VISTEC, Sch Informat Sci & Technol, Hanoi, Vietnam
[2] Khon Kaen Univ, Fac Interdisciplinary Studies, Khon Kaen, Thailand
[3] Univ Wolverhampton, RGCL, Wolverhampton, England
[4] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method's generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai's.
引用
收藏
页码:1003 / 1016
页数:14
相关论文
共 50 条
  • [1] GAN-BASED OUT-OF-DOMAIN DETECTION USING BOTH IN-DOMAIN AND OUT-OF-DOMAIN SAMPLES
    Liang, Chaojie
    Huang, Peijie
    Lai, Wenbin
    Ruan, Ziheng
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7663 - 7667
  • [2] CONTEXTUAL OUT-OF-DOMAIN UTTERANCE HANDLING WITH COUNTERFEIT DATA AUGMENTATION
    Lee, Sungjin
    Shalyminov, Igor
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7205 - 7209
  • [3] On Calibration and Out-of-domain Generalization
    Wald, Yoav
    Feder, Amir
    Greenfeld, Daniel
    Shalit, Uri
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Improved Post-hoc Probability Calibration for Out-of-Domain MRI Segmentation
    Ouyang, Cheng
    Wang, Shuo
    Chen, Chen
    Li, Zeju
    Bai, Wenjia
    Kainz, Bernhard
    Rueckert, Daniel
    UNCERTAINTY FOR SAFE UTILIZATION OF MACHINE LEARNING IN MEDICAL IMAGING, 2022, 13563 : 59 - 69
  • [5] Extraction of Specific Arguments from Chinese Financial News with out-of-domain Samples
    Luo, Yu
    Zou, Xinyi
    Liu, Di
    Peng, Wanwan
    Wu, Xiaohua
    PROCEEDINGS OF THE 10TH INTERNATIONAL CONFERENCE OF INFORMATION AND COMMUNICATION TECHNOLOGY, 2021, 183 : 288 - 294
  • [6] Cross-domain Paraphrasing For Improving Language Modelling Using Out-of-domain Data
    Liu, X.
    Gales, M. J. F.
    Woodland, P. C.
    14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5, 2013, : 3391 - 3395
  • [7] OutFlip: Generating Out-of-Domain Samples for Unknown Intent Detection with Natural Language Attack
    Choi, DongHyun
    Shin, Myeong Cheol
    Kim, EungGyun
    Shin, Dong Ryeol
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021, 2021, : 504 - 512
  • [8] Certifying Out-of-Domain Generalization for Blackbox Functions
    Weber, Maurice
    Li, Linyi
    Wang, Boxin
    Zhao, Zhikuan
    Li, Bo
    Zhang, Ce
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [9] Rewriting a Generative Model with Out-of-Domain Patterns
    Gao, Panpan
    Sun, Hanxu
    Chen, Gang
    Li, Minggang
    ELECTRONICS, 2025, 14 (04):
  • [10] Semantic Segmentation with Generative Models: Semi-Supervised Learning and Strong Out-of-Domain Generalization
    Li, Daiqing
    Yang, Junlin
    Kreis, Karsten
    Torralba, Antonio
    Fidler, Sanja
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 8296 - 8307