共 21 条
Multi3 WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems
被引:0
|作者:
Hu, Songbo
[1
]
Zhou, Han
[1
]
Hergul, Mete
[1
]
Gritta, Milan
[2
]
Zhang, Guchun
[2
]
Iacobacci, Ignacio
[2
]
Vulic, Ivan
[1
]
Korhonen, Anna
[1
]
机构:
[1] Univ Cambridge, Language Technol Lab, Cambridge, England
[2] Huawei Noahs Ark Lab, London, England
关键词:
66;
D O I:
10.1162/tacl_a_00609
中图分类号:
TP18 [人工智能理论];
学科分类号:
081104 ;
0812 ;
0835 ;
1405 ;
摘要:
Creating high-quality annotated data for task-oriented dialog (ToD) is known to be notoriously difficult, and the challenges are amplified when the goal is to create equitable, culturally adapted, and large-scale ToD datasets for multiple languages. Therefore, the current datasets are still very scarce and suffer from limitations such as translation-based non-native dialogs with translation artefacts, small scale, or lack of cultural adaptation, among others. In this work, we first take stock of the current landscape of multilingual ToD datasets, offering a systematic overview of their properties and limitations. Aiming to reduce all the detected limitations, we then introduce Multi(3)WOZ, a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages to enable training and evaluation of multilingual and cross-lingual ToD systems. We describe a complex bottom-up data collection process that yielded the final dataset, and offer the first sets of baseline scores across different ToD-related tasks for future reference, also highlighting its challenging nature.
引用
收藏
页码:1396 / 1415
页数:20
相关论文