FGraDA: A Dataset and Benchmark for Fine-Grained Domain Adaptation in Machine Translation

被引：0

作者：

Zhu, Wenhao ^{[1
,2
]}

Huang, Shujian ^{[1
,2
]}

Pu, Tong ^{[1
,2
]}

Huang, Pingxuan ^{[3
]}

Zhang, Xu ^{[4
]}

Yu, Jian ^{[4
]}

Chen, Wei ^{[4
]}

Wang, Yanfeng ^{[4
]}

Chen, Jiajun ^{[1
,2
]}

机构：

[1] Nanjing Univ, Natl Key Lab Novel Software Technol, Nanjing, Peoples R China

[2] Collaborat Innovat Ctr Novel Software Technol & I, Nanjing, Peoples R China

[3] Univ Michigan, Ann Arbor, MI 48109 USA

[4] Sogou Inc, Beijing, Peoples R China

来源：

LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION | 2022年

基金：

美国国家科学基金会; 国家重点研发计划;

关键词：

Domain Adaptation; Fine-Grained Domains; Machine Translation;

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Previous research for adapting a general neural machine translation (NMT) model into a specific domain usually neglects the diversity in translation within the same domain, which is a core problem for domain adaptation in real-world scenarios. One representative of such challenging scenarios is to deploy a translation system for a conference with a specific topic, e.g., global warming or coronavirus, where there are usually extremely less resources due to the limited schedule. To motivate wider investigation in such a scenario, we present a real-world fine-grained domain adaptation task in machine translation (FGraDA). The FGraDA dataset consists of Chinese-English translation task for four sub-domains of information technology: autonomous vehicles, AI education, real-time networks, and smart phone. Each sub-domain is equipped with a development set and test set for evaluation purposes. To be closer to reality, FGraDA does not employ any in-domain bilingual training data but provides bilingual dictionaries and wiki knowledge base, which can be easier obtained within a short time. We benchmark the fine-grained domain adaptation task and present in-depth analyses showing that there are still challenging problems to further improve the performance with heterogeneous resources.

引用

页码：6719 / 6727

页数：9

共 50 条

[1] PaperNet: A Dataset and Benchmark for Fine-Grained Paper Classification
Yue, Tan
Li, Yong
Shi, Xuzhao
Qin, Jiedong
Fan, Zijiao
Hu, Zonghai
[J]. APPLIED SCIENCES-BASEL, 2022, 12 (09):
[2] A benchmark dataset and approach for fine-grained visual categorization in complex scenes
Zhang, Xiang
Zhang, Keran
Zhao, Wanqing
Luo, Hangzai
Zhong, Sheng
Tang, Lei
Peng, Jinye
Fan, Jianping
[J]. DIGITAL SIGNAL PROCESSING, 2023, 137
[3] A new dataset of dog breed images and a benchmark for fine-grained classification
Ding-Nan Zou
Song-Hai Zhang
Tai-Jiang Mu
Min Zhang
[J]. Computational Visual Media, 2020, 6 (04) : 477 - 487
[4] Fine-grained attention mechanism for neural machine translation
Choi, Heeyoul
Cho, Kyunghyun
Bengio, Yoshua
[J]. NEUROCOMPUTING, 2018, 284 : 171 - 176
[5] Fine-Grained Domain Adaptation for Chinese Syntactic Processing
Zhang, Meishan
Guo, Peiming
Jiang, Peijie
Long, Dingkun
Sun, Yueheng
Xu, Guangwei
Xie, Pengjun
Zhang, Min
[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2023, 22 (11)
[6] Fine-grained Unsupervised Domain Adaptation for Gait Recognition
Ma, Kang
Fu, Ying
Zheng, Dezhi
Peng, Yunjie
Cao, Chunshui
Huang, Yongzhen
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11279 - 11288
[7] ePillID Dataset: A Low-Shot Fine-Grained Benchmark for Pill Identification
Usuyama, Naoto
Delgado, Natalia Larios
Hall, Amanda K.
Lundin, Jessica
[J]. 2020 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW 2020), 2020, : 3971 - 3977
[8] VegFru: A Domain-Specific Dataset for Fine-grained Visual Categorization
Hou, Saihui
Feng, Yushan
Wang, Zilei
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, : 541 - 549
[9] Fine-grained Knowledge Fusion for Sequence Labeling Domain Adaptation
Yang, Huiyun
Huang, Shujian
Dai, Xinyu
Chen, Jiajun
[J]. 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP 2019): PROCEEDINGS OF THE CONFERENCE, 2019, : 4197 - 4206
[10] A Fine-Grained Sentiment Dataset for Norwegian
Ovrelid, Lilja
Maehlum, Petter
Barnes, Jeremy
Velldal, Erik
[J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5025 - 5033

← 1 2 3 4 5 →