AfriWOZ: Corpus for Exploiting Cross-Lingual Transfer for Dialogue Generation in Low-Resource, African Languages

被引:0
|
作者
Adewumi, Tosin [1 ,2 ]
Adeyemi, Mofetoluwa [2 ]
Anuoluwapo, Aremu [2 ]
Peters, Bukola [3 ]
Buzaaba, Happy [2 ]
Samuel, Oyerinde [2 ]
Rufai, Amina Mardiyyah [2 ]
Ajibade, Benjamin [2 ]
Gwadabe, Tajudeen [2 ]
Traore, Mory Moussou Koulibaly [2 ]
Ajayi, Tunde Oluwaseyi [2 ]
Muhammad, Shamsuddeen
Baruwa, Ahmed [2 ]
Owoicho, Paul [2 ]
Ogunremi, Tolulope [2 ]
Ngigi, Phylis [4 ]
Ahia, Orevaoghene [2 ]
Nasir, Ruqayya [2 ]
Liwicki, Foteini [1 ]
Liwicki, Marcus [1 ]
机构
[1] Lulea Univ Technol, ML Grp, Lulea, Sweden
[2] Masakhane, Newark, NJ USA
[3] CIS, Washington, DC USA
[4] Jomo Kenyatta Univ Agr & Technol, Juja, Kenya
关键词
dialogue systems; NLG; low-resource; multilingual; crosslingual;
D O I
10.1109/IJCNN54540.2023.10191208
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yoruba. There are a total of 9,000 turns, each language having 1,500 turns, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we benchmark by investigating & analyzing the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of singleturn conversations by using majority votes and measure interannotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.
引用
收藏
页数:8
相关论文
共 50 条
  • [31] Cross-lingual subspace Gaussian mixture models for low-resource speech recognition
    [J]. 1600, Institute of Electrical and Electronics Engineers Inc., United States (22):
  • [32] CAM: A cross-lingual adaptation framework for low-resource language speech recognition
    Hu, Qing
    Zhang, Yan
    Zhang, Xianlei
    Han, Zongyu
    Yu, Xilong
    [J]. INFORMATION FUSION, 2024, 111
  • [33] SUBSPACE MIXTURE MODEL FOR LOW-RESOURCE SPEECH RECOGNITION IN CROSS-LINGUAL SETTINGS
    Miao, Yajie
    Metze, Florian
    Waibel, Alex
    [J]. 2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7339 - 7343
  • [34] Unsupervised Stem-based Cross-lingual Part-of-Speech Tagging for Morphologically Rich Low-Resource Languages
    Eskander, Ramy
    Lowry, Cass
    Khandagale, Sujay
    Klavans, Judith
    Polinsky, Maria
    Muresan, Smaranda
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4061 - 4072
  • [35] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Zolzaya Byambadorj
    Ryota Nishimura
    Altangerel Ayush
    Kengo Ohta
    Norihide Kitaoka
    [J]. EURASIP Journal on Audio, Speech, and Music Processing, 2021
  • [36] Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    [J]. EURASIP JOURNAL ON AUDIO SPEECH AND MUSIC PROCESSING, 2021, 2021 (01)
  • [37] Sentiment analysis on a low-resource language dataset using multimodal representation learning and cross-lingual transfer learning
    Gladys, A. Aruna
    Vetriselvi, V.
    [J]. APPLIED SOFT COMPUTING, 2024, 157
  • [38] Improving Low-Resource Cross-lingual Document Retrieval by Reranking with Deep Bilingual Representations
    Zhang, Rui
    Westerfield, Caitlin
    Shim, Sungrok
    Bingham, Garrett
    Fabbri, Alexander
    Hu, William
    Verma, Neha
    Radev, Dragomir
    [J]. 57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3173 - 3179
  • [39] Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios
    Eskander, Ramy
    Muresan, Smaranda
    Collins, Michael
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4820 - 4831
  • [40] Multi-speaker TTS system for low-resource language using cross-lingual transfer learning and data augmentation
    Byambadorj, Zolzaya
    Nishimura, Ryota
    Ayush, Altangerel
    Ohta, Kengo
    Kitaoka, Norihide
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 849 - 853