Handling Cross- and Out-of-Domain Samples in ThaiWord Segmentation

被引:0
|
作者
Limkonchotiwat, Peerat [1 ]
Phatthiyaphaibun, Wannaphong [2 ]
Sarwar, Raheem [3 ]
Chuangsuwanich, Ekapol [4 ]
Nutanong, Sarana [1 ]
机构
[1] VISTEC, Sch Informat Sci & Technol, Hanoi, Vietnam
[2] Khon Kaen Univ, Fac Interdisciplinary Studies, Khon Kaen, Thailand
[3] Univ Wolverhampton, RGCL, Wolverhampton, England
[4] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method's generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai's.
引用
收藏
页码:1003 / 1016
页数:14
相关论文
共 50 条
  • [41] Improving Adversarial Robustness via Unlabeled Out-of-Domain Data
    Deng, Zhun
    Zhang, Linjun
    Ghorbani, Amirata
    Zou, James
    24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
  • [42] Identifying Out-of-Domain Objects with Dirichlet Deep Neural Networks
    Hammam, Ahmed
    Bonarens, Frank
    Ghobadi, Seyed Eghabl
    Stiller, Christoph
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4562 - 4571
  • [43] In and Out-of-Domain Text Adversarial Robustness via Label Smoothing
    Yang, Yahan
    Dan, Soham
    Roth, Dan
    Lee, Insup
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 657 - 669
  • [44] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data
    Fang, Gongfan
    Bao, Yifan
    Song, Jie
    Wang, Xinchao
    Xie, Donglin
    Shen, Chengchao
    Song, Mingli
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [45] Detecting Out-Of-Domain Utterances Addressed to a Virtual Personal Assistant
    Tur, Gokhan
    Deoras, Anoop
    Hakkani-Tur, Dilek
    15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 283 - 287
  • [46] Out-of-Domain Detection for Natural Language Understanding in Dialog Systems
    Zheng, Yinhe
    Chen, Guanyi
    Huang, Minlie
    IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1198 - 1209
  • [47] Co-clustering based Classification for Out-of-domain Documents
    Dai, Wenyuan
    Xue, Gui-Rong
    Yang, Qiang
    Yu, Yong
    KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 210 - +
  • [48] Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction
    Guan, Shanyan
    Xu, Jingwei
    Wang, Yunbo
    Ni, Bingbing
    Yang, Xiaokang
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10467 - 10476
  • [49] A simple baseline for domain generalization of action recognition and a realistic out-of-domain scenario
    Kim, Hyungmin
    Jeon, Hobeum
    Kim, Dohyung
    Kim, Jaehong
    2023 20TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS, UR, 2023, : 515 - 520
  • [50] In-domain versus out-of-domain transfer learning in plankton image classification
    Maracani, Andrea
    Pastore, Vito Paolo
    Natale, Lorenzo
    Rosasco, Lorenzo
    Odone, Francesca
    SCIENTIFIC REPORTS, 2023, 13 (01)