AfriWOZ: Corpus for Exploiting Cross-Lingual Transfer for Dialogue Generation in Low-Resource, African Languages

被引:0
|
作者
Adewumi, Tosin [1 ,2 ]
Adeyemi, Mofetoluwa [2 ]
Anuoluwapo, Aremu [2 ]
Peters, Bukola [3 ]
Buzaaba, Happy [2 ]
Samuel, Oyerinde [2 ]
Rufai, Amina Mardiyyah [2 ]
Ajibade, Benjamin [2 ]
Gwadabe, Tajudeen [2 ]
Traore, Mory Moussou Koulibaly [2 ]
Ajayi, Tunde Oluwaseyi [2 ]
Muhammad, Shamsuddeen
Baruwa, Ahmed [2 ]
Owoicho, Paul [2 ]
Ogunremi, Tolulope [2 ]
Ngigi, Phylis [4 ]
Ahia, Orevaoghene [2 ]
Nasir, Ruqayya [2 ]
Liwicki, Foteini [1 ]
Liwicki, Marcus [1 ]
机构
[1] Lulea Univ Technol, ML Grp, Lulea, Sweden
[2] Masakhane, Newark, NJ USA
[3] CIS, Washington, DC USA
[4] Jomo Kenyatta Univ Agr & Technol, Juja, Kenya
关键词
dialogue systems; NLG; low-resource; multilingual; crosslingual;
D O I
10.1109/IJCNN54540.2023.10191208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yoruba. There are a total of 9,000 turns, each language having 1,500 turns, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we benchmark by investigating & analyzing the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of singleturn conversations by using majority votes and measure interannotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Cross-Lingual Morphological Tagging for Low-Resource Languages
    Buys, Jan
    Botha, Jan A.
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1954 - 1964
  • [2] Adversarial Cross-Lingual Transfer Learning for Slot Tagging of Low-Resource Languages
    He, Keqing
    Yan, Yuanmeng
    Xu, Weiran
    [J]. 2020 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2020,
  • [3] Cross-lingual offensive speech identification with transfer learning for low-resource languages
    Shi, Xiayang
    Liu, Xinyi
    Xu, Chun
    Huang, Yuanyuan
    Chen, Fang
    Zhu, Shaolin
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2022, 101
  • [4] Exploiting Adapters for Cross-Lingual Low-Resource Speech Recognition
    Hou, Wenxin
    Zhu, Han
    Wang, Yidong
    Wang, Jindong
    Qin, Tao
    Xu, Renju
    Shinozaki, Takahiro
    [J]. IEEE-ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2022, 30 : 317 - 329
  • [5] XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages
    Abhishek, Tushar
    Sagare, Shivprasad
    Singh, Bhavyajeet
    Sharma, Anubhav
    Gupta, Manish
    Varma, Vasudeva
    [J]. COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 171 - 175
  • [6] Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
    Zhang, Mozhi
    Fujinuma, Yoshinari
    Boyd-Graber, Jordan
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9547 - 9554
  • [7] Unsupervised Ranked Cross-Lingual Lexical Substitution for Low-Resource Languages
    Ecker, Stefan
    Horbach, Andrea
    Thater, Stefan
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 1709 - 1717
  • [8] Deep Persian sentiment analysis: Cross-lingual training for low-resource languages
    Ghasemi, Rouzbeh
    Ashrafi Asli, Seyed Arad
    Momtazi, Saeedeh
    [J]. JOURNAL OF INFORMATION SCIENCE, 2022, 48 (04) : 449 - 462
  • [9] Automatic Wordnet Development for Low-Resource Languages using Cross-Lingual WSD
    Taghizadeh, Nasrin
    Faili, Hesham
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2016, 56 : 61 - 87
  • [10] Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages
    Schlichtkrull, Michael Sejr
    Sogaard, Anders
    [J]. 15TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2017), VOL 1: LONG PAPERS, 2017, : 220 - 229