Handling Cross- and Out-of-Domain Samples in ThaiWord Segmentation

被引：0

作者：

Limkonchotiwat, Peerat ^{[1
]}

Phatthiyaphaibun, Wannaphong ^{[2
]}

Sarwar, Raheem ^{[3
]}

Chuangsuwanich, Ekapol ^{[4
]}

Nutanong, Sarana ^{[1
]}

机构：

[1] VISTEC, Sch Informat Sci & Technol, Hanoi, Vietnam

[2] Khon Kaen Univ, Fac Interdisciplinary Studies, Khon Kaen, Thailand

[3] Univ Wolverhampton, RGCL, Wolverhampton, England

[4] Chulalongkorn Univ, Dept Comp Engn, Bangkok, Thailand

来源：

FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-IJCNLP 2021 | 2021年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While word segmentation is a solved problem in many languages, it is still a challenge in continuous-script or low-resource languages. Like other NLP tasks, word segmentation is domain-dependent, which can be a challenge in low-resource languages like Thai and Urdu since there can be domains with insufficient data. This investigation proposes a new solution to adapt an existing domaingeneric model to a target domain, as well as a data augmentation technique to combat the low-resource problems. In addition to domain adaptation, we also propose a framework to handle out-of-domain inputs using an ensemble of domain-specific models called MultiDomain Ensemble (MDE). To assess the effectiveness of the proposed solutions, we conducted extensive experiments on domain adaptation and out-of-domain scenarios. Moreover, we also proposed a multiple task dataset for Thai text processing, including word segmentation. For domain adaptation, we compared our solution to the state-of-the-art Thai word segmentation (TWS) method and obtained improvements from 93.47% to 98.48% at the character level and 84.03% to 96.75% at the word level. For out-of-domain scenarios, our MDE method significantly outperformed the state-of-the-art TWS and multi-criteria methods. Furthermore, to demonstrate our method's generalizability, we also applied our MDE framework to other languages, namely Chinese, Japanese, and Urdu, and obtained improvements similar to Thai's.

引用

页码：1003 / 1016

页数：14

共 50 条

[41] Improving Adversarial Robustness via Unlabeled Out-of-Domain Data
Deng, Zhun
Zhang, Linjun
Ghorbani, Amirata
Zou, James
24TH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS (AISTATS), 2021, 130
[42] Identifying Out-of-Domain Objects with Dirichlet Deep Neural Networks
Hammam, Ahmed
Bonarens, Frank
Ghobadi, Seyed Eghabl
Stiller, Christoph
2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4562 - 4571
[43] In and Out-of-Domain Text Adversarial Robustness via Label Smoothing
Yang, Yahan
Dan, Soham
Roth, Dan
Lee, Insup
61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 657 - 669
[44] Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data
Fang, Gongfan
Bao, Yifan
Song, Jie
Wang, Xinchao
Xie, Donglin
Shen, Chengchao
Song, Mingli
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[45] Detecting Out-Of-Domain Utterances Addressed to a Virtual Personal Assistant
Tur, Gokhan
Deoras, Anoop
Hakkani-Tur, Dilek
15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 283 - 287
[46] Out-of-Domain Detection for Natural Language Understanding in Dialog Systems
Zheng, Yinhe
Chen, Guanyi
Huang, Minlie
IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2020, 28 : 1198 - 1209
[47] Co-clustering based Classification for Out-of-domain Documents
Dai, Wenyuan
Xue, Gui-Rong
Yang, Qiang
Yu, Yong
KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 210 - +
[48] Bilevel Online Adaptation for Out-of-Domain Human Mesh Reconstruction
Guan, Shanyan
Xu, Jingwei
Wang, Yunbo
Ni, Bingbing
Yang, Xiaokang
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 10467 - 10476
[49] A simple baseline for domain generalization of action recognition and a realistic out-of-domain scenario
Kim, Hyungmin
Jeon, Hobeum
Kim, Dohyung
Kim, Jaehong
2023 20TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS, UR, 2023, : 515 - 520
[50] In-domain versus out-of-domain transfer learning in plankton image classification
Maracani, Andrea
Pastore, Vito Paolo
Natale, Lorenzo
Rosasco, Lorenzo
Odone, Francesca
SCIENTIFIC REPORTS, 2023, 13 (01)

← 1 2 3 4 5 →