Handling Cross- and Out-of-Domain Samples in ThaiWord Segmentation

被引:0
|
作者
Limkonchotiwat, Peerat [1 ]
Phatthiyaphaibun, Wannaphong [2 ]
Sarwar, Raheem [3 ]
Chuangsuwanich, Ekapol [4 ]
Nutanong, Sarana [1 ]
机构
[1] VISTEC, Sch Informat Sci & Technol, Hanoi, Vietnam
[2] Khon Kaen Univ, Fac Interdisciplinary Studies, Khon Kaen, Thailand
[3] Univ Wolverhampton, RGCL, Wolverhampton, England
[4] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method's generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai's.
引用
收藏
页码:1003 / 1016
页数:14
相关论文
共 50 条
  • [21] An Out-of-Domain Test Suite for Dependency Parsing of German
    Seeker, Wolfgang
    Kuhn, Jonas
    LREC 2014 - NINTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2014, : 4066 - 4073
  • [22] Pretraining boosts out-of-domain robustness for pose estimation
    Mathis, Alexander
    Biasi, Thomas
    Schneider, Steffen
    Yuksekgonul, Mert
    Rogers, Byron
    Bethge, Matthias
    Mathis, Mackenzie W.
    2021 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV 2021), 2021, : 1858 - 1867
  • [23] Detecting Annotation Scheme Variation in Out-of-Domain Treebanks
    Versley, Yannick
    Steen, Julius
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2354 - 2360
  • [24] Metric Learning and Adaptive Boundary for Out-of-Domain Detection
    Lorenc, Petr
    Gargiani, Tommaso
    Pichl, Jan
    Konrad, Jakub
    Marek, Petr
    Kobza, Ondrej
    Sedivy, Jan
    NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 127 - 134
  • [25] Out-of-domain Detection based on Generative Adversarial Network
    Ryu, Seonghan
    Koo, Sangjun
    Yu, Hwanjo
    Lee, Gary Geunbae
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 714 - 718
  • [26] A Green Pipeline for Out-of-Domain Public Sentiment Analysis
    Xie, Ming
    Jiang, Jing
    Shen, Tao
    Wang, Yang
    Gerrard, Leah
    Clarke, Allison
    ADVANCED DATA MINING AND APPLICATIONS, ADMA 2021, PT I, 2022, 13087 : 190 - 202
  • [27] The predictability of the effectiveness of chains of classifiers in the out-of-domain detection
    Cofta, Piotr
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2025, 139
  • [28] On the Effects of Transformer Size on In- and Out-of-Domain Calibration
    Dan, Soham
    Roth, Dan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2096 - 2101
  • [29] In-domain versus out-of-domain transfer learning for document layout analysis
    De Nardin, Axel
    Zottin, Silvia
    Piciarelli, Claudio
    Foresti, Gian Luca
    Colombi, Emanuela
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2024,
  • [30] Adversarial Self-Supervised Learning for Out-of-Domain Detection
    Zeng, Zhiyuan
    He, Keqing
    Yan, Yuanmeng
    Xu, Hong
    Xu, Weiran
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 5631 - 5639